Part B: Lunar Lander Reinforcement Learning Analysis¶

Our Team:

  • Member 1: Bey Wee Loon - p2112802
  • Member 2: Quah Johnnie - p2007476

1.0 Introduction to Reinforcement Learning¶

What is Reinforcement Learning?

Reinforcement Learning (RL) is a subset of machine learning that focuses on how an agent interacts with its environment to achieve a specific goal by taking a sequence of actions that maximizes a numerical reward signal. The process of RL involves balancing the exploration and exploitation of different actions. Exploration refers to the agent choosing an action that it has not tried before, while exploitation involves following a known action that has previously led to a positive outcome.

Examples application of Reinforcement Learning:

Trading: RL can be used to make trading decisions by training an agent to choose the best actions based on historical market data and other relevant factors. The agent, in this case, would be the trading algorithm while the environment would include the market conditions and other traders. RL can help in finding the optimal trading strategies and predicting the market trends.

Video Games: Reinforcement learning can also be used to create intelligent game playing agents. The agent in this case would learn to play a game by exploring different strategies, receiving rewards and penalties based on the outcome of its actions and adapting its behavior accordingly. For example, a RL-based agent playing chess would learn to make the best moves by trying out different strategies, receiving rewards for winning and penalties for losing, and updating its decision-making policy based on this experience.

Robotics: Reinforcement learning can be used to control the behavior of robots. The agent can learn to perform tasks by exploring different actions, receiving rewards or penalties based on the outcome and updating its behavior accordingly. For example, a RL-based robot can be trained to navigate through a maze or pick and place objects.

Control Systems: Reinforcement learning can be applied to control systems to optimize their performance. The agent can learn to control the system by trying out different actions, receiving rewards or penalties based on the system's performance, and updating its behavior accordingly. For example, a RL-based control system can be used to optimize the energy consumption of a

Notebook Table of Contents¶


Emoji Legend:
📙 Main Heading
📖 Subheading
🤓 Research/Discussion
🤖 RL code/modeling/training
🔬 RL model evaluation/analysis
⚙️ Config/utility code (Found throughout notebook)


Headings & subheadings in content table are clickable (Please use them 🙂 - This report may be long)

Description Headings

📙 Imports & Configuration

2.0

📖 Libraries

2.1

📖🤓 Environment Background Research

2.2

📙🤓 Deep Q-learning

3.0

📖🤖 Deep Q-learning Network - (DQN)

3.1

📖🤓🤖 Deep Q-learning Network Agent

3.2

📖🤓🤖 Experience Replay

3.3

📖🤓 Epsilon Greedy

3.5

📖🤖 Deep Q-learning Network Training

3.6

📖🔬 Evaluate DQN Performance

3.7

📖🤓 Double Deep Q-Learning Network - (DDQN)

4.0

📖🤖 Double Deep Q-Learning Network Modelling

4.1

📖🤖 Double Deep Q-Learning Network Training

4.2

📖🔬 Evaluate DDQN Performance

4.3

📙🤓 Actor and Critic /w PPO

5.0

📖🤖 Actor and Critic Network Modelling

5.1

📖🤖 Training Algorithm Code

5.3

📖🤖 Training /w Actor & Critic (PPO)

5.4

📖🔬 Evaluate Actor & Critic /w PPO Performance

5.5

📙🤓 DQN + Prioritized Experience Replay - Improving Our Best Candidate

6.0

📖🤖 Adding PER into our DQN Agent

6.1

📖🤖 Training DQN + PER

6.2

📖🔬 Evaluate DQN + PER Performance

6.3

📙🤓 Hyperparameter Tuning - 5 Hyperparameters

7.0

📖🤖 Running Hyperparameter Tuner

7.1

📖🔬 Evaluate Hyperparameter Tuned DQN + PER

7.2

📙 Final Evaluation - Objective Testing

8.0

📖🤖 Training All Models (1000 Episodes)

8.1

📖🔬 Testing All Models (500 Episodes)

8.2

📖🔬 Final Evaluation Of Test Results

8.3

2.0 Importing & Configurations¶

In [ ]:
# !pip install swig
# !pip install gym[box2d]

2.1 Libraries¶


  • Back to content table
In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.distributions as distributions
import torch.optim as optim
import base64, io, os
from copy import deepcopy
from tqdm.auto import tqdm
from ipywidgets import Output, GridspecLayout, Layout
from IPython.display import clear_output
from IPython import display


import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.lines import Line2D

import numpy as np
import pandas as pd
import random, json, itertools, time
from collections import deque, namedtuple

# For visualization
import gym
from gym.wrappers.monitoring import video_recorder
from IPython.display import HTML
from IPython import display 
import glob

device = torch.device("cpu" if torch.cuda.is_available() else "cpu")

2.2 Environment Background Research 🤓¶

OpenAI GYM

OpenAI GYM is a toolkit for developing and comparing reinforcement learning algorithms. It has a library that contains a variety of reinforcement learning tasks, including trading simulations. These environments have a shared and easy-to-use interface, allowing researchers and developers to train and compare agents using different algorithms.

OpenAI Gym Environment: LunarLander-v2

OpenAI Lunar Lander Gym Environment

Lunar Lander is one of OpenAI's gym environments, where the agent is a lunar lander that tries to land on a landing pad at coordinate (0,0). These coordinates are the first 2 numbers in the state vector. Also, this environment is created by Oleg Klimov.

There are a total of 4 discrete actions that the lander can take:

  1. Do nothing
  2. Fire left orientation engine
  3. Fire main engine
  4. Fire right orientation engine

There are a total of 8 observation spaces:

  1. The coordinates of the lander in x
  2. The coordinates of the lander in y
  3. Its linear velocities in x
  4. Its linear velocities in y
  5. Its angle
  6. Its angular velocity
  7. If left leg is in contact with the ground
  8. If right leg is in contact with the ground

Reward system of Lunar Lander:

  • Moving from the top of the screen to the landing pad at zero speed can range from 100 to 140 points.
  • If the lander moves away from the landing pad, it loses back the landing points.
  • The episode finishes if the lander crashes (receives -100 points) or comes to a rest (receives 100 points).
  • Each leg ground contact is an additional 10 points.
  • Firing main engine -0.3 reward per step.
  • Firing side engine -0.03 reward per step.

From the above, we can conclude that the best episode would be the one where the lander lands (+10 each leg contact) and comes to rest (+100 for rest) on the center of the landing pad at zero speed (+120 for landing) with the least steps (+220 - (count_main_engine_fire * 0.3 + count_side_engine_fire * 0.03)).

"enable_wind" is a parameter in the Lunar Lander v2 environment of OpenAI Gym. It determines whether wind is included in the simulation or not. When enable_wind is set to True, wind is included as a disturbance force acting on the lunar lander, adding an extra layer of difficulty to the task of landing. For our assignment here, we will be enabling the wind in the environment.

[References: OpenAI - LunarLander Documentation]


  • Back to content table

Note that env.seed() is already deprecated in the latest version of gym, to set seed just simply so np.random.seed()

  • To increase the difficulty and also explore more complex algorithms, we have enabled wind to use a harder and more complex environment
In [2]:
env = gym.make('LunarLander-v2',enable_wind=True)
np.random.seed(0)
print('State shape: ', env.observation_space.shape)
print('Number of actions: ', env.action_space.n)
State shape:  (8,)
Number of actions:  4

3.0 Deep Q-learning 🤓¶

Q Learning

Q Learning builds a Q-table of state-action values, with the number of states and actions being the dimension of the table. This table maps state and action pairs to a Q-value. The disadvantage for this method is that in real-world scenario, the table can become very large and becomes difficulty to manage.

Q Function

The Q-function in reinforcement learning represents the expected cumulative discounted reward of taking a specific action in a given state and following a fixed policy thereafter. The formula for the Q-function is given by:

$$Q(s, a) = E[R_t + \gamma * R_{t+1} + \gamma^2 * R_{t+2} + ... | s_t = s, a_t = a ]$$

where:

Symbol Meaning
$s_t$ represents the state at time t
$a_t$ represents the action taken at time t
$R_t$ represents the reward at time t
$\gamma$ represents the discount factor, which determines the importance of future rewards relative to immediate rewards (0 < $\gamma$ $\leq$ 1)

The Q-function estimates the expected cumulative discounted reward of taking action "a" in state "s" and following a fixed policy thereafter. It is used to determine the optimal policy, which is the policy that selects the action that maximizes the Q-value for each state.

Deep Q Learning

To address this issue, a Q function can be used instead of a Q table, this achieves the same result of the state-action pair mapping to the Q-value. Neural networks are excellent at modelling complex functions, we can use a neural network, Deep Q Learning (DQN) to approximate this Q function. This Q value can also be referred to as state-action function.

The Q network is the agent that is trained to produce the optimal state-action value. This current Q network is a fairly standard network architecture, containing a few linear layers. DQN architectures contains two neural networks, online network and target network.

Target Network

The target network has the exact same architecture as the online network. This network will not be trained and will only output predictions, these outputs are referred to as Target Q values.

Reason for this second network is to help stabilize training process. During training, the online network is trained on randomly intialized values each step. The issue with this is that the network's estimate can change rapidly during training, making the training process unstable. The target network is implemented to provide a more stable target for the network to learn from. The online network is trained to estimate the Q values of the target network, rather than its own estimates. This allows for a more stable training process

This target network is only updated periodically with the parameters of the online network. This is referred to as a soft update as it is not a complete copy, but rather an update with a smaller weight.

[References: Reinforcement Learning Explained Visually (Part 5): Deep Q Networks, step-by-step]


  • Back to content table

3.1 DQN Network 🤖¶


  • Back to content table
In [3]:
class QNetwork(nn.Module):
    def __init__(self, state_dim: int, action_dim: int, hidden_size=64):
        # Initialize the parameters and model
        super(QNetwork, self).__init__()
        self.fc1 = nn.Linear(state_dim, hidden_size) # First fully connected layer with state_dim inputs
        self.fc2 = nn.Linear(hidden_size, hidden_size) # Second fully connected layer
        self.fc3 = nn.Linear(hidden_size, action_dim) # Third fully connected layer with action_dim outputs
        
    def forward(self, state):
        # Build the network that maps state -> action values
        x = self.fc1(state)
        x = F.relu(x)
        x = self.fc2(x)
        x = F.relu(x)
        return self.fc3(x)

3.2 DQN Agent 🤓🤖¶

Agent

The agent is the entity that takes actions in the environment and receives feedback in the form of rewards. The agent's behavior is determined bythea Q-network. The DQN algorithm trains this network to output the expected cumulative reward for taking a specific action in a given state. The agent selects actions based on the outputs of the Q-network, and the network is updated based on the observed rewards.

Mean Squared Error

Mean squared error (MSE) will be used to compute the differences in the predicted reward and observed reward. The MSE loss function measures the average squared difference between the predicted values and the actual values, which provides a measure of the error in the predictions. Minimizing the MSE loss function with gradient descent leads to improved predictions, which leads to better performance in the reinforcement learning task.

Bellman Equation

The Bellman equation defines the relationship between the expected cumulative reward for a given state and action, and the expected cumulative reward for the next state that results from that action. It provides a way to recursively compute the optimal action-value function, which maps states and actions to expected cumulative rewards.


  • Back to content table
In [4]:
class Agent:
    # Interacts with and learns from the environment.
    def __init__(self, state_dim, action_dim, hidden_dim, network):
        '''
        Initialize an Agent object.
        
        Parameters
        ----------
            state_dim (int): Dimension of each state
            action_dim (int): Dimension of each action
        '''
        self.state_dim = state_dim
        self.action_dim = action_dim

        # Q-Network
        self.qnetwork_online = network(state_dim, action_dim, hidden_dim).to(device)
        self.qnetwork_target = network(state_dim, action_dim, hidden_dim).to(device)
        self.optimizer = optim.Adam(self.qnetwork_online.parameters(), lr=LR)

        # Replay memory
        self.memory = ReplayBuffer(action_dim, BUFFER_SIZE, BATCH_SIZE)
        # Initialize time step (for updating every UPDATE_EVERY steps)
        self.t_step = 0

    def step(self, state, action, reward, next_state, done):
        # Saves the experience in replay memory, and learns from it in specified intervals."
        self.memory.add(state, action, reward, next_state, done)
        
        self.t_step = (self.t_step + 1) % UPDATE_EVERY
        if self.t_step == 0:
            if len(self.memory) > BATCH_SIZE:
                experiences = self.memory.sample()
                self.learn(experiences, GAMMA)

    def act(self, state, eps=0.):
        '''
        Returns actions for given state as per current policy.
        
        Parameters
        ----------
            state (array_like): Current state
            eps (float): Epsilon, for epsilon-greedy action selection
        
        Returns
        -------
            int: The selected action
        '''
        state = torch.from_numpy(state).float().unsqueeze(0).to(device)
        self.qnetwork_online.eval()
        with torch.no_grad():
            action_values = self.qnetwork_online(state)
        self.qnetwork_online.train()

        # Epsilon-greedy action selection
        if random.random() > eps:
            return np.argmax(action_values.cpu().data.numpy())
        else:
            return random.choice(np.arange(self.action_dim))

    def learn(self, experiences, gamma):
        '''
        Update value parameters using given batch of experience tuples.

        Parameters
        ----------
            experiences (Tuple[torch.Variable]): Tuple of (s, a, r, s', done) tuples 
            gamma (float): Discount factor
        '''
        # Obtain random minibatch of tuples from D
        states, actions, rewards, next_states, dones = experiences

        # Compute and minimize the loss
        # Extract next maximum estimated value from target network
        q_targets_next = self.qnetwork_target(next_states).detach().max(1)[0].unsqueeze(1)
        # Calculate target value from bellman equation
        q_targets = rewards + gamma * q_targets_next * (1 - dones)
        # Calculate expected value from local network
        q_expected = self.qnetwork_online(states).gather(1, actions)
        
        # Loss calculation (we used Mean squared error)
        loss = F.mse_loss(q_expected, q_targets)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        # Update target network 
        self.soft_update(self.qnetwork_online, self.qnetwork_target, TAU)

    def soft_update(self, local_model, target_model, tau):
        '''
        Soft update model parameters.
        θ_target = τ*θ_local + (1 - τ)*θ_target

        Parameters:
        ----------
            local_model (PyTorch model): weights will be copied from
            target_model (PyTorch model): weights will be copied to
            tau (float): interpolation parameter 
        '''
        # Copy weights of the local (online) network to the target network
        for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
            target_param.data.copy_(tau*local_param.data + (1.0-tau)*target_param.data)

3.3 Experience Replay 🤓⚙️¶

The idea behind experience replay is to store the agent's experiences, represented as tuples of (state, action, reward, next_state), in a memory buffer, and then randomly sample these experiences to train the Q-network. Random selection is performed to ensure that the batch is shuffled and contains diversity from older and newer samples.

Neural network are typically trained on batches of data. If we were to train it with a single sample each iteration, the resulting gradient will have too much variance and the network weights will never converge. Not only can this technique stabilize the training process, it can also replay rare experiences, ones that are infrequent. By storing them in the memory, the agent can replay them multiple times and learn from them more effectively.


  • Back to content table
In [5]:
class ReplayBuffer:
    # A fixed-size container to store experience tuples.
    def __init__(self, action_dim, buffer_size, batch_size):
        '''
        Initialize the buffer object.

        Parameters
        ----------
            action_dim (int): dimension of each action
            buffer_size (int): maximum size of the buffer
            batch_size (int): size of each training batch
        '''
        self.action_dim = action_dim
        self.memory = deque(maxlen=buffer_size)  
        self.batch_size = batch_size
        self.experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])

    def add(self, state, action, reward, next_state, done):
        # Add a new experience to the buffer.
        e = self.experience(state, action, reward, next_state, done)
        self.memory.append(e)

    def sample(self):
        # Select a random batch of experiences from the buffer.
        experiences = random.sample(self.memory, k=self.batch_size)

        states = torch.from_numpy(np.vstack([e.state for e in experiences if e is not None])).float().to(device)
        actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).long().to(device)
        rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to(device)
        next_states = torch.from_numpy(np.vstack([e.next_state for e in experiences if e is not None])).float().to(device)
        dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to(device)

        return (states, actions, rewards, next_states, dones)

    def __len__(self):
        # Return the current size of the buffer.
        return len(self.memory)

3.4 Save Videos ⚙️¶

Videos of the agent in an environment will be saved at specified intevals of episodes.


  • Back to content table
In [7]:
def show_video(file_name, width = 400):
    mp4list = glob.glob('video/*.mp4')
    if len(mp4list) > 0:
        mp4 = 'video/{}.mp4'.format(file_name)
        video = io.open(mp4, 'r+b').read()
        encoded = base64.b64encode(video)
        display.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: {}px;">
                <source src="data:video/mp4;base64,{}" type="video/mp4" />
                </video>'''.format(width, encoded.decode('ascii'))))
    else:
        print("Could not find video")
        
def save_video(agent, file_name, model_ckpt= 'checkpoint_best.pth', max_t=1000,seed = 0):
    env = gym.make('LunarLander-v2',enable_wind=True, render_mode="rgb_array")
    vid = video_recorder.VideoRecorder(env, path="video/{}.mp4".format(file_name))
    agent.qnetwork_online.load_state_dict(torch.load('./models/' + model_ckpt))
    state = env.reset(seed = seed)[0]
    done = False
    t = 0
    rewards = 0
    while not done and t != max_t:
        t += 1
        frame = env.render()
        vid.capture_frame()
        
        action = agent.act(state)
        state, reward, done, _, _ = env.step(action)
        rewards += reward
        
    env.close()
    return rewards

3.5 Epsilon Greedy 🤓¶

The greedy epsilon is a method used to balance exploration and exploitation in the Q-learning process. The Q-value function estimates the value of taking a certain action in a given state. In Q-learning, the agent selects the action with the highest Q-value, which is known as the greedy action. However, this approach can lead to the agent getting stuck in a suboptimal solution if it only selects the greedy action.

The epsilon-greedy algorithm addresses this issue by introducing a probability epsilon of selecting a random action instead of the greedy action. This allows the agent to explore new actions and states, which can lead to finding better solutions.

In the algorithm, max_epsilon and min_epsilon will be defined. During the course of the training, the epsilon will decay overtime until it reaches the value of min_epsilon. Decreasing the epsilon helps encourage the algorithm to rely more on the values it has learned and less on random exploration.


  • Back to content table
In [7]:
n_episodes = 500
min_epsilon = 0.01
max_epsilon = 1.0
decay_rate = 1-0.995

# initialize epsilon values for greedy search
epsilon_array = np.zeros((n_episodes))
for i in range(n_episodes):
    epsilon = min_epsilon + (max_epsilon-min_epsilon)*np.exp(-decay_rate*i)
    epsilon_array[i] = epsilon

plt.plot(epsilon_array)
plt.show()

The value of epsilon decreases over time, allowing the agent to gradually shift from exploration to exploitation.

3.6 Training /w DQN🤖¶

The agent is trained for a maximum number of episodes (n_numbers), where each episode can run for a maximum of time steps (max_t). Time step controls how many time steps can be taken in each episode, the smaller it is, lesser time step the agent will be able to take to solve the environment in that episode.

The agent selects actions using an epsilon-greedy policy, where the value of epsilon starts from eps_start and gradually decreases to eps_end during training.

This function tracks and displays the average and current scores, episode lengths, and success and landing rates over the past 100 episodes (Simple Moving Average 100).

The current state of the agent's Q-network is saved every display_every episodes, and a video of the agent's performance is recorded and displayed every 2*display_every episodes. The best-performing agent is saved and its scores are displayed once the average score of the past 100 episodes reaches 200 or higher.


  • Back to content table

Code to train our agent with Deep Q-Learning Networks (includes DDQN) ⚙️

In [8]:
def train_agent(n_episodes: int=3000, max_t: int=1000, eps_start: float=1.0, 
        eps_end: float=0.01, eps_decay: float=0.995, display_every: int=150, model_name='DQN',
        video_filepath='LunarLander_training'):
    '''
    Train a Network agent
    
    Parameters:
        n_episodes (int): Maximum number of episodes for training
        max_t (int): Maximum number of timesteps per episode
        eps_start (float): Initial value of epsilon for epsilon-greedy action selection
        eps_end (float): Minimum value of epsilon
        eps_decay (float): Factor to decrease epsilon per episode
    '''
    scores = []                        # list containing scores from each episode
    scores_SMA100 = []
    scores_window = deque(maxlen=100)  # last 100 scores
    eps = eps_start                    # initialize epsilon
    time_taken = []
    time_taken_window = deque(maxlen=100)
    success_rate = deque(maxlen=100)
    landing_rate = deque(maxlen=100)
    success_rate_SMA100 = []
    landing_rate_SMA100 = []
    for i_episode in tqdm(range(1, n_episodes+1)):
        state = env.reset()[0]
        score = 0
        for t in range(max_t):
            action = agent.act(state, eps)
            next_state, reward, done, _, _ = env.step(action)
            agent.step(state, action, reward, next_state, done)
            state = next_state
            score += reward
            if done:
                if score >= 200:
                    success_rate.append(1)
                    landing_rate.append(1)
                elif score >= 120:
                    success_rate.append(0)
                    landing_rate.append(1)
                else:
                    success_rate.append(0)
                    landing_rate.append(0)
                break
        scores_window.append(score)       # save most recent score
        scores_SMA100.append(np.mean(scores_window))
        scores.append(score)              # save most recent score
        time_taken.append(t)
        time_taken_window.append(t)
        landing_rate_SMA100.append(landing_rate.count(1))
        success_rate_SMA100.append(success_rate.count(1))
        eps = max(eps_end, eps_decay*eps) # decrease epsilon
        if i_episode % display_every == 0:
            # SMA100: Average of past 100 period (Simple Moving Average)
            print(f'\rEpisode {i_episode}\tAvg Score (SMA100): {np.mean(scores_window):.3f} Current Score: {scores_window[-1]:.0f}\nAvg Episode Length (SMA100): {np.mean(time_taken_window)} Current Episode Length: {time_taken_window[-1]:.3f}\nLanding Rate: {landing_rate.count(1):.0f}% | Success Rate: {success_rate.count(1):.0f}%\n')
            torch.save(agent.qnetwork_online.state_dict(), f'.\models\{model_name+str(i_episode)}_train.pth')
        if i_episode % (display_every*2) == 0:
            save_video(agent, video_filepath, f'{model_name+str(i_episode)}_train.pth')
        elif i_episode % (display_every*2+1) == 0:
            show_video(video_filepath, 200)
        if np.mean(scores_window)>=200.0:
            print('\nEnvironment solved in {:d} episodes!'.format(i_episode))
            print(f'\rAvg Score (SMA100): {np.mean(scores_window):.3f} Current Score: {scores_window[-1]:.0f}\nAvg Episode Length (SMA100): {np.mean(time_taken_window)} Current Episode Length: {time_taken_window[-1]:.3f}\nLanding Rate: {landing_rate.count(1):.0f}% | Success Rate: {success_rate.count(1):.0f}%\n')
            torch.save(agent.qnetwork_online.state_dict(), f'.\models\{model_name}_best.pth')
            break

    return {
        'scores': scores, 'scores_SMA100': scores_SMA100,
        'scores_window': scores_window, 'time_taken': time_taken,
        'time_taken_window': time_taken_window, 'success_rate': success_rate, 
        'landing_rate': landing_rate, 'landing_rate_SMA100': landing_rate_SMA100,
        'success_rate_SMA100': success_rate_SMA100
    }
In [17]:
BUFFER_SIZE = int(1e5)  # replay buffer size
BATCH_SIZE = 64         # minibatch size
GAMMA = 0.99            # discount factor
TAU = 1e-3              # for soft update of target parameters
LR = 0.0005             # learning rate 
UPDATE_EVERY = 4        # how often to update the network

Metrics Legend:

  • Avg Episode Length (SMA100): Average Episode Time Frame For The Past 100 Episodes
  • Current Score: Current Reward For The Given Episode
  • Landing Rate: Estimated % Of Agent Landing For The Past 100 Episodes (Reward >= 120)
  • Success Rate: Estimated % Of Agent Landing In The Center For The Past 100 Episodes (Reward >= 200)
In [10]:
# Reset seed - Use same seed for all experiments (objective comparison)
np.random.seed(0)
env = gym.make('LunarLander-v2',enable_wind=True)

# DQN
agent = Agent(state_dim=8, action_dim=4, hidden_dim=64, network=QNetwork)
results_DQN = train_agent(display_every=200, max_t=1000, video_filepath='DQN')
  0%|          | 0/3000 [00:00<?, ?it/s]
Episode 200	Avg Score (SMA100): -219.803 Current Score: -227
Avg Episode Length (SMA100): 147.04 Current Episode Length: 171.000
Landing Rate: 0% | Success Rate: 0%

Episode 400	Avg Score (SMA100): -135.108 Current Score: -1
Avg Episode Length (SMA100): 446.32 Current Episode Length: 243.000
Landing Rate: 0% | Success Rate: 0%

Moviepy - Building video video/DQN.mp4.
Moviepy - Writing video video/DQN.mp4

t:   0%|          | 0/1000 [00:00<?, ?it/s, now=None]
t:   7%|▋         | 69/1000 [00:00<00:01, 684.46it/s, now=None]
t:  20%|██        | 204/1000 [00:00<00:00, 1073.47it/s, now=None]
t:  34%|███▍      | 343/1000 [00:00<00:00, 1215.14it/s, now=None]
t:  48%|████▊     | 478/1000 [00:00<00:00, 1265.86it/s, now=None]
t:  61%|██████    | 610/1000 [00:00<00:00, 1285.00it/s, now=None]
t:  74%|███████▍  | 744/1000 [00:00<00:00, 1299.69it/s, now=None]
t:  88%|████████▊ | 875/1000 [00:00<00:00, 1298.12it/s, now=None]
                                                                 
Moviepy - Done !
Moviepy - video ready video/DQN.mp4
Episode 600	Avg Score (SMA100): -150.778 Current Score: -173
Avg Episode Length (SMA100): 424.03 Current Episode Length: 146.000
Landing Rate: 0% | Success Rate: 0%

Episode 800	Avg Score (SMA100): -153.087 Current Score: -64
Avg Episode Length (SMA100): 594.59 Current Episode Length: 999.000
Landing Rate: 1% | Success Rate: 0%

Moviepy - Building video video/DQN.mp4.
Moviepy - Writing video video/DQN.mp4

t:   0%|          | 0/143 [00:00<?, ?it/s, now=None]
t:  43%|████▎     | 62/143 [00:00<00:00, 616.84it/s, now=None]
                                                              
Moviepy - Done !
Moviepy - video ready video/DQN.mp4
Episode 1000	Avg Score (SMA100): -83.026 Current Score: -70
Avg Episode Length (SMA100): 868.31 Current Episode Length: 999.000
Landing Rate: 2% | Success Rate: 0%

Episode 1200	Avg Score (SMA100): -23.869 Current Score: 200
Avg Episode Length (SMA100): 759.1 Current Episode Length: 676.000
Landing Rate: 31% | Success Rate: 18%

Moviepy - Building video video/DQN.mp4.
Moviepy - Writing video video/DQN.mp4

t:   0%|          | 0/1000 [00:00<?, ?it/s, now=None]
t:   6%|▌         | 58/1000 [00:00<00:01, 576.98it/s, now=None]
t:  19%|█▉        | 190/1000 [00:00<00:00, 1010.04it/s, now=None]
t:  31%|███▏      | 313/1000 [00:00<00:00, 1110.05it/s, now=None]
t:  43%|████▎     | 434/1000 [00:00<00:00, 1145.14it/s, now=None]
t:  57%|█████▋    | 569/1000 [00:00<00:00, 1218.70it/s, now=None]
t:  71%|███████   | 708/1000 [00:00<00:00, 1273.02it/s, now=None]
t:  84%|████████▍ | 844/1000 [00:00<00:00, 1301.07it/s, now=None]
t:  98%|█████████▊| 980/1000 [00:00<00:00, 1318.95it/s, now=None]
                                                                 
Moviepy - Done !
Moviepy - video ready video/DQN.mp4
Episode 1400	Avg Score (SMA100): 15.677 Current Score: -113
Avg Episode Length (SMA100): 792.01 Current Episode Length: 999.000
Landing Rate: 77% | Success Rate: 31%

Episode 1600	Avg Score (SMA100): 36.667 Current Score: 70
Avg Episode Length (SMA100): 678.93 Current Episode Length: 999.000
Landing Rate: 69% | Success Rate: 33%

Moviepy - Building video video/DQN.mp4.
Moviepy - Writing video video/DQN.mp4

t:   0%|          | 0/344 [00:00<?, ?it/s, now=None]
t:  18%|█▊        | 63/344 [00:00<00:00, 626.76it/s, now=None]
t:  57%|█████▋    | 196/344 [00:00<00:00, 1036.40it/s, now=None]
t:  95%|█████████▌| 328/344 [00:00<00:00, 1162.81it/s, now=None]
                                                                
Moviepy - Done !
Moviepy - video ready video/DQN.mp4
Episode 1800	Avg Score (SMA100): 128.531 Current Score: -110
Avg Episode Length (SMA100): 548.12 Current Episode Length: 232.000
Landing Rate: 80% | Success Rate: 63%


Environment solved in 1951 episodes!
Avg Score (SMA100): 201.090 Current Score: 274
Avg Episode Length (SMA100): 430.74 Current Episode Length: 655.000
Landing Rate: 89% | Success Rate: 72%

Minor code ⚙️

In [38]:
def saveJSON(dict, filename):
    # Convert deque list to list
    for key, value in dict.copy().items():
        if isinstance(value, deque):
            dict[key] = list(value)
            
    # Save dictionary as .json
    with open(filename, 'w') as handle:
        json.dump(dict, handle)

def loadJSON(filename):
    # Load .json
    with open(filename, 'r') as handle:
        result = json.load(handle)
    return result

3.7 Evaluate DQN Performance🔬¶


  • Back to content table

Utility code to plot graphs ⚙️

In [39]:
# Label line graph points
def labelMaxMin(ax, history, field):
    # Find min
    if field in ['time_taken', 'time_taken_window']:
        minmax = np.min(history[field])
        legend_value = f'Min: {minmax:.2f}'
        y = minmax - 15
        
    else: # Find max
        minmax = np.max(history[field])
        legend_value = f'Max {minmax:.2f}'
        y = minmax * 1.05
        
    # Label
    epoch = np.where(history[field] == minmax)[0]
    if len(epoch) > 1:
        for elem in epoch:
            ax.plot(elem, minmax, 'ro')
        epoch = epoch[-1]
    ax.annotate(f'{minmax:.4f}', xy=(epoch, y))
    ax.plot(epoch, minmax, 'ro')
    
    # Create legend for marker
    legend_element = Line2D([0], [0], marker='o', color='r', label=legend_value)
    return legend_element

def highlightAvg(ax, history, field, idx):
    # Compute mean
    mean_value = np.mean(history[field])
    if idx == 0:
        ax.axhline(y=mean_value, linestyle='--', color='blue')
    else:
        ax.axhline(y=mean_value, linestyle='--', color='orange')
    
    # Label mean value
    epochs = len(history[field])
    text = f'{mean_value:.4f}'
    range_value = max(history[field]) - min(history[field])
    if mean_value > 100:
        ax.annotate(text, xy=(-5, mean_value + (range_value/100)))
    elif mean_value >= 0 and mean_value < 1:
        ax.annotate(text, xy=(-5, mean_value))
    elif mean_value >= 0:
        ax.annotate(text, xy=(-5, mean_value * (range_value/100)))
    else:
        ax.annotate(text, xy=(-5, mean_value / (range_value/100)))
        
    return mean_value

def makeChart(ax, history, field, idx=0):    
    ax.plot(history[field])
    # Details
    mean_value = highlightAvg(ax, history, field, idx) # Highlight mean value
    minmax_legend = labelMaxMin(ax, history, field) # Label min/max
    
    # Display legend
    if idx == 0:
        metric_legend = Line2D([0], [0], lw=2, color='blue', label=f'{field}')
        mean_legend = Line2D([0], [0], lw=2, color='blue', linestyle='dotted', label=f'Average: {mean_value:.2f}')
    else:
        metric_legend = Line2D([0], [0], lw=2, color='orange', label=f'{field}')
        mean_legend = Line2D([0], [0], lw=2, color='orange', linestyle='dotted', label=f'Average: {mean_value:.2f}')
        
    legend_elements = [metric_legend, minmax_legend, mean_legend]
    return legend_elements

# Loss and accuracy plots
def plotResult(history, fields):
    fig, ax = plt.subplots(1, 2, figsize=(18, 8))
        
    for i in range(len(fields)):            
        # Check if its a nested list e.g. ([['success_rate', 'landing_rate'], 'others']) 
        if len(fields[i]) > 1:
            # Plot chart with two fields
            legend_element = []
            for field in fields[i]:
                idx = fields[i].index(field)
                legend = makeChart(ax[i], history, field, idx)
                legend_element.append(legend)
            
            # Label legend
            flattened_list = []
            for sublist in legend_element:
                for item in sublist:
                    flattened_list.append(item)
                    
            ax[i].legend(handles=flattened_list)
            
            # Label
            if 'rate' in fields[0][0] or 'rate' in fields[i][1]:
                ax[i].set_ylabel('percentage')
            
            if len(history[field]) == 100:
                ax[i].set_xlabel('Past 100')
            else:
                ax[i].set_xlabel('Episodes')
                
            ax[i].set_title(f'{fields[0][0].capitalize()} and {fields[0][1].capitalize()}')
            
            
        else: # Plot chart with one field
            field = fields[i][0]
            legend = makeChart(ax[i], history, field)
            ax[i].legend(handles=legend)
            ax[i].set_ylabel(field)
            
            if len(history[field]) == 100:
                ax[i].set_xlabel('Past 100')
            else:
                ax[i].set_xlabel('Episodes')
            ax[i].set_title(f'{field.capitalize()}')
        
    plt.tight_layout()
    plt.show()
    
    # Final result
    flattened_list = []
    for sublist in fields:
        for item in sublist:
            flattened_list.append(item)
    print(f'Past {len(history[flattened_list[0]])} Episodes')
    print('====================================')
    for field in flattened_list:
        print(f'Final {field}: {history[field][-1]:.2f}')
In [30]:
saveJSON(results_DQN, 'dict_DQN.json')

Video example of reward > 200 (Land successfully + Land between flag + Land relatively quickly)

In [12]:
show_video('DQN')
In [15]:
# Plot result
dict_DQN = loadJSON('dict_DQN.json')
sns.set_style("whitegrid")
plotResult(dict_DQN, [['success_rate_SMA100', 'landing_rate_SMA100'], ['scores_SMA100']])
Past 1951 Episodes
====================================
Final success_rate_SMA100: 72.00
Final landing_rate_SMA100: 89.00
Final scores_SMA100: 201.09
  • Landing Rate (Reward >= 120)
  • Success Rate (Reward >= 200)

Observation:
We observed that the agent began to learn how to land with both feets around Episode 700 and successfully landed within the flags by Episode 1000. The landing rate improved rapidly, outpacing the success rate around Episode 1200. In this context, variance refers to the difference between landing and success rates. High variance indicates that the agent has mastered landing techniques, but is not as efficient, as it may waste a significant amount of reward. For example, if the agent lands 90% of the time, but its success rate remains low (less than 200 rewards), its efficiency would be questionable. Conversely, low variance implies that the landing and success rates are closely aligned, suggesting that the agent is similar effectiveness and efficient.

We can see that after Episode 1200, the variance starts to increase before decreasing. Implying that the agent is starting to land efficiently after landing how to land effectively.

4.0 Double Deep Q-Learning Network🤓¶

Double Deep Q-Learning Network (DDQN) is a variant of the deep Q-network (DQN) algorithm. DDQN addresses the problem of overestimation of action values that can occur in standard DQN, leading to suboptimal policies.

In DQN, the action-value function is updated using the maximum estimated Q-value of the next state obtained from the target network. This can lead to overestimation of the Q-values.

In DDQN, the action-value function is updated using two networks, the local network to select the action and the target network to estimate the expected future reward for that action. This separation of action selection and value estimation reduces the potential for overestimation of Q-values.

[References: Double Deep Q Networks]


  • Back to content table

4.1 Double Deep Q-Learning Network Modelling 🤖¶

  • Note that replay buffer, epsilon greedy etc. are also used in DDQN training

  • Back to content table
In [18]:
BUFFER_SIZE = int(1e5)  # replay buffer size
BATCH_SIZE = 64         # minibatch size
GAMMA = 0.99            # discount factor
TAU = 1e-3              # for soft update of target parameters
LR = 0.0005             # learning rate 
UPDATE_EVERY = 4        # how often to update the network
In [57]:
class DDQN(nn.Module): 
    """
    Deep Reinforcement Learning with Double Q-Learning by Hasselt et al. (2016)
    Double Deep Q-Network Model Graph
    The neural network is a function from state space $R^{n_states}$ to action space $R^{n_actions}$
    
    """ 
    def __init__(self, n_states, n_actions, hidden_size=32): 
        super(DDQN, self).__init__()
        self.n_actions = n_actions
        self.hidden_size = hidden_size
        # hidden representation
        self.dense_layer_1 = nn.Linear(n_states, hidden_size)
        self.dense_layer_2 = nn.Linear(hidden_size, hidden_size)
        self.dense_layer_3 = nn.Linear(hidden_size, hidden_size)
        # V(s)
        self.v_layer_1 = nn.Linear(hidden_size, hidden_size)
        self.v_layer_2 = nn.Linear(hidden_size, hidden_size // 2)
        self.v_layer_3 = nn.Linear(hidden_size // 2, 1)
        # A(s, a)
        self.a_layer_1 = nn.Linear(hidden_size, hidden_size)
        self.a_layer_2 = nn.Linear(hidden_size, hidden_size // 2)
        self.a_layer_3 = nn.Linear(hidden_size // 2, n_actions)
        
    def forward(self, state):
        x = F.relu(self.dense_layer_1(state))
        x = F.relu(self.dense_layer_2(x))
        x = F.relu(self.dense_layer_3(x))
        v = F.relu(self.v_layer_1(x))
        v = F.relu(self.v_layer_2(v))
        v = self.v_layer_3(v)
        a = F.relu(self.a_layer_1(x))
        a = F.relu(self.a_layer_2(a))
        a = self.a_layer_3(a)
        
        return v + a - a.mean(dim=-1, keepdim=True).expand(-1, self.n_actions)

4.2 Double Deep Q-Learning Network Training🤖¶

Metrics Legend:

  • Avg Episode Length (SMA100): Average Episode Time Frame For The Past 100 Episodes
  • Current Score: Current Reward For The Given Episode
  • Landing Rate: Estimated % Of Agent Landing For The Past 100 Episodes (Reward >= 120)
  • Success Rate: Estimated % Of Agent Landing In The Center For The Past 100 Episodes (Reward >= 200)

  • Back to content table
In [11]:
# Reset seed - Use same seed for all experiments (objective comparison)
np.random.seed(0)
env = gym.make('LunarLander-v2',enable_wind=True)

# DDQN
agent = Agent(state_dim=8, action_dim=4, hidden_dim=64, network=DDQN)
results_DDQN = train_agent(display_every=200, max_t=1000, video_filepath='DDQN')
  0%|          | 0/3000 [00:00<?, ?it/s]
Episode 200	Avg Score (SMA100): -282.159 Current Score: -202
Avg Episode Length (SMA100): 122.19 Current Episode Length: 191.000
Landing Rate: 0% | Success Rate: 0%

Episode 400	Avg Score (SMA100): -96.351 Current Score: 182
Avg Episode Length (SMA100): 395.26 Current Episode Length: 753.000
Landing Rate: 2% | Success Rate: 0%

Moviepy - Building video video/DDQN.mp4.
Moviepy - Writing video video/DDQN.mp4

t:   0%|          | 0/541 [00:00<?, ?it/s, now=None]
t:  13%|█▎        | 70/541 [00:00<00:00, 694.46it/s, now=None]
t:  38%|███▊      | 205/541 [00:00<00:00, 1077.03it/s, now=None]
t:  63%|██████▎   | 342/541 [00:00<00:00, 1208.82it/s, now=None]
t:  89%|████████▊ | 480/541 [00:00<00:00, 1272.71it/s, now=None]
                                                                
Moviepy - Done !
Moviepy - video ready video/DDQN.mp4
Episode 600	Avg Score (SMA100): -142.619 Current Score: -159
Avg Episode Length (SMA100): 403.16 Current Episode Length: 253.000
Landing Rate: 4% | Success Rate: 1%

Episode 800	Avg Score (SMA100): -143.774 Current Score: -269
Avg Episode Length (SMA100): 445.92 Current Episode Length: 800.000
Landing Rate: 4% | Success Rate: 0%

Moviepy - Building video video/DDQN.mp4.
Moviepy - Writing video video/DDQN.mp4

t:   0%|          | 0/952 [00:00<?, ?it/s, now=None]
t:   7%|▋         | 66/952 [00:00<00:01, 655.88it/s, now=None]
t:  21%|██        | 199/952 [00:00<00:00, 1046.95it/s, now=None]
t:  35%|███▍      | 331/952 [00:00<00:00, 1163.34it/s, now=None]
t:  49%|████▉     | 470/952 [00:00<00:00, 1250.04it/s, now=None]
t:  63%|██████▎   | 601/952 [00:00<00:00, 1268.47it/s, now=None]
t:  77%|███████▋  | 732/952 [00:00<00:00, 1282.38it/s, now=None]
t:  91%|█████████▏| 869/952 [00:00<00:00, 1307.53it/s, now=None]
                                                                
Moviepy - Done !
Moviepy - video ready video/DDQN.mp4
Episode 1000	Avg Score (SMA100): -175.103 Current Score: -205
Avg Episode Length (SMA100): 484.74 Current Episode Length: 519.000
Landing Rate: 0% | Success Rate: 0%

Episode 1200	Avg Score (SMA100): -92.123 Current Score: -170
Avg Episode Length (SMA100): 594.58 Current Episode Length: 682.000
Landing Rate: 17% | Success Rate: 3%

Moviepy - Building video video/DDQN.mp4.
Moviepy - Writing video video/DDQN.mp4

t:   0%|          | 0/1000 [00:00<?, ?it/s, now=None]
t:   6%|▌         | 59/1000 [00:00<00:01, 589.77it/s, now=None]
t:  19%|█▊        | 186/1000 [00:00<00:00, 985.12it/s, now=None]
t:  31%|███       | 311/1000 [00:00<00:00, 1103.29it/s, now=None]
t:  44%|████▍     | 444/1000 [00:00<00:00, 1188.77it/s, now=None]
t:  56%|█████▋    | 563/1000 [00:00<00:00, 1186.31it/s, now=None]
t:  69%|██████▉   | 689/1000 [00:00<00:00, 1208.35it/s, now=None]
t:  82%|████████▏ | 815/1000 [00:00<00:00, 1221.20it/s, now=None]
t:  95%|█████████▍| 946/1000 [00:00<00:00, 1247.59it/s, now=None]
                                                                 
Moviepy - Done !
Moviepy - video ready video/DDQN.mp4
Episode 1400	Avg Score (SMA100): 93.774 Current Score: -15
Avg Episode Length (SMA100): 614.15 Current Episode Length: 91.000
Landing Rate: 65% | Success Rate: 30%

Episode 1600	Avg Score (SMA100): 79.704 Current Score: -105
Avg Episode Length (SMA100): 551.25 Current Episode Length: 706.000
Landing Rate: 56% | Success Rate: 30%

Moviepy - Building video video/DDQN.mp4.
Moviepy - Writing video video/DDQN.mp4

t:   0%|          | 0/318 [00:00<?, ?it/s, now=None]
t:  16%|█▋        | 52/318 [00:00<00:00, 515.84it/s, now=None]
t:  55%|█████▌    | 176/318 [00:00<00:00, 936.30it/s, now=None]
t:  98%|█████████▊| 313/318 [00:00<00:00, 1128.92it/s, now=None]
                                                                
Moviepy - Done !
Moviepy - video ready video/DDQN.mp4
Episode 1800	Avg Score (SMA100): 42.873 Current Score: -23
Avg Episode Length (SMA100): 538.49 Current Episode Length: 999.000
Landing Rate: 46% | Success Rate: 21%

Episode 2000	Avg Score (SMA100): 101.315 Current Score: 268
Avg Episode Length (SMA100): 535.4 Current Episode Length: 204.000
Landing Rate: 65% | Success Rate: 46%

Moviepy - Building video video/DDQN.mp4.
Moviepy - Writing video video/DDQN.mp4

t:   0%|          | 0/878 [00:00<?, ?it/s, now=None]
t:   6%|▌         | 52/878 [00:00<00:01, 515.96it/s, now=None]
t:  20%|█▉        | 173/878 [00:00<00:00, 920.27it/s, now=None]
t:  34%|███▍      | 298/878 [00:00<00:00, 1068.02it/s, now=None]
t:  48%|████▊     | 422/878 [00:00<00:00, 1135.10it/s, now=None]
t:  62%|██████▏   | 548/878 [00:00<00:00, 1154.23it/s, now=None]
t:  77%|███████▋  | 676/878 [00:00<00:00, 1192.36it/s, now=None]
t:  92%|█████████▏| 805/878 [00:00<00:00, 1223.43it/s, now=None]
                                                                
Moviepy - Done !
Moviepy - video ready video/DDQN.mp4
Episode 2200	Avg Score (SMA100): 167.124 Current Score: 239
Avg Episode Length (SMA100): 375.89 Current Episode Length: 379.000
Landing Rate: 77% | Success Rate: 61%


Environment solved in 2312 episodes!
Avg Score (SMA100): 201.737 Current Score: 237
Avg Episode Length (SMA100): 336.03 Current Episode Length: 223.000
Landing Rate: 86% | Success Rate: 71%

In [14]:
saveJSON(results_DDQN, 'results_DDQN.json')

4.3 Evaluate DDQN Performance🔬¶


  • Back to content table
In [20]:
show_video('DDQN')
In [21]:
results_DDQN = loadJSON('results_DDQN.json')
In [22]:
# Plot result
results_DDQN = loadJSON('results_DDQN.json')
sns.set_style("whitegrid")
plotResult(results_DDQN, [['success_rate_SMA100', 'landing_rate_SMA100'], ['scores_SMA100']])
Past 2312 Episodes
====================================
Final success_rate_SMA100: 71.00
Final landing_rate_SMA100: 86.00
Final scores_SMA100: 201.74

Observation:
We observed that the agent began to learn how to land with both feets around Episode 400 and successfully landed within the flags by Episode 420. The landing rate improved rapidly, outpacing the success rate around Episode 1300.

Compared to DQN plots, the training landing rate and success rate of the agent may appear unstable. However, a closer inspection reveals that the agent may be learning to land more efficiently, as demonstrated by the decrease in landing rate in Episode 1300, where the success rate decline is not as significant. This suggests that the agent is focusing on landing more efficiently rather than just effectively.

5.0 Actor and Critic | Proximal Policy Optimization (PPO)🤓¶

Actor-Critic is a popular reinforcement learning (RL) algorithm that combines both value-based and policy-based methods. The Actor refers to the policy network that maps the current state of the environment to an action, while the Critic is a value network that estimates the expected reward of a given state-action pair.

Proximal Policy Optimization (PPO) is an algorithm that can be used to improve the Actor-Critic algorithm by controlling the step size between consecutive policies in the optimization process. In traditional policy gradient algorithms, there is a risk of updating the policy too much, leading to a destabilization of the learning process. PPO addresses this issue by using a surrogate objective function that limits the step size between consecutive policies, allowing for more stable and efficient learning.

Symbol Meaning
$\pi_{\theta}$ The policy represented by the parameter vector $\theta$
$s_t$ The state at time t
$A_t$ The action taken at time t
$R_t$ The reward at time t
$\gamma$ The discount factor, which determines the importance of future rewards relative to immediate rewards (0 < $\gamma$ $\leq$ 1)
$J(\theta)$ The objective function to be optimized in PPO

PPO is a reinforcement learning algorithm that improves upon the traditional policy gradient methods. PPO aims to stabilize the policy update process and avoid oscillations that can occur with traditional policy gradient methods.

The objective function in PPO is given by:

$J(\theta) = \mathbb{E}_{t}[\text{min}(r_t(\theta)\cdot A_t,\text{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\cdot A_t)]$

where:

  • $A_t$ is the advantage function defined as $A_t = Q_t - V_t(s_t)$, where $Q_t$ is the estimated cumulative reward and $V_t(s_t)$ is the value function that estimates the expected cumulative reward starting from state $s_t$ and following policy $\pi_{\theta}$.
  • $r_t(\theta) = \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$ is the ratio of the current policy and the old policy.
  • $\text{clip}(x,a,b)$ is the clipping function that returns $x$ if $x \in [a,b]$ and returns $a$ if $x<a$ and returns $b$ if $x>b$.
  • $\epsilon$ is a hyperparameter that determines the magnitude of the clipping function.

The objective function is a combination of the surrogate objective, $r_t(\theta)\cdot A_t$, and the clipping function, $\text{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\cdot A_t$. The surrogate objective encourages the improvement of the current policy, while the clipping function acts as a constraint that limits the magnitude of the policy update and helps stabilize the learning process.

PPO is an improvement over the traditional Q-function approach because it provides a more stable and effective way to update the policy, reducing the risk of oscillation and divergence. Additionally, PPO is computationally efficient and easier to implement compared to other reinforcement learning algorithms, making it a popular choice for real-world applications.

[References: Proximal Policy Optimization] [References: The Actor Critic Reinforcement Learning Algorithm]


  • Back to content table

5.1 Actor and Critic Network Modelling🤖¶


  • Back to content table
In [8]:
class MLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, dropout = 0.1):
        super().__init__()
        
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.Dropout(dropout),
            # PReLU -> Variant of LeakyReLU
            nn.PReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.Dropout(dropout),
            nn.PReLU(),
            nn.Linear(hidden_dim, output_dim)
        )
        
    def forward(self, x):
        x = self.net(x)
        return x
In [9]:
class ActorCritic(nn.Module):
    def __init__(self, actor, critic):
        super().__init__()
        
        self.actor = actor
        self.critic = critic
        
    def forward(self, state):
        action_pred = self.actor(state)
        value_pred = self.critic(state)
        
        return action_pred, value_pred
In [10]:
def init_weights(m):
    if type(m) == nn.Linear:
        torch.nn.init.xavier_normal_(m.weight)
        m.bias.data.fill_(0)

5.1.1 PPO returns🤓🤖¶

Returns are used to evaluate the quality of a policy and to provide a signal for updating the policy network. The return is the discounted sum of future rewards, and it provides information about how well the policy is performing. The policy network is updated to maximize the expected return, which is the sum of future rewards expected under the current policy. The optimization process adjusts the parameters of the policy network so that it predicts higher probabilities for actions that lead to higher returns.


  • Back to content table
In [11]:
def calculate_returns(rewards, discount_factor, normalize = True):
    
    returns = []
    R = 0
    
    for r in reversed(rewards):
        R = r + R * discount_factor
        returns.insert(0, R)
        
    returns = torch.tensor(returns)
    
    if normalize:
        returns = (returns - returns.mean()) / returns.std()
        
    return returns

5.1.2 PPO advantages🤓🤖¶

PPO advantages are used to adjust the probability of taking a particular action in a state. The policy network outputs a probability distribution over actions, and the advantages are used to adjust this distribution to favor actions that lead to higher rewards.


  • Back to content table
In [12]:
def calculate_advantages(returns, values, normalize = True):
    
    advantages = returns - values
    
    if normalize:
        
        advantages = (advantages - advantages.mean()) / advantages.std()
        
    return advantages

By using both advantages and returns, PPO balances the trade-off between exploration and exploitation. The policy network explores the environment by trying new actions, and it exploits the knowledge gained from past experiences by favoring actions that lead to higher rewards. Over time, the policy network learns to take actions that lead to higher returns, leading to an improvement in the overall quality of the policy. This technique is used to replace the traditional epsilon greedy method.

5.2.0 Utility Code⚙️¶

Update agent code

In [13]:
def update_policy(policy, states, actions, log_prob_actions, advantages, returns, optimizer, ppo_steps, ppo_clip):
    
    states = states.detach()
    actions = actions.detach()
    log_prob_actions = log_prob_actions.detach()
    advantages = advantages.detach()
    returns = returns.detach()
    
    for _ in range(ppo_steps):
                
        #get new log prob of actions for all input states
        action_pred, value_pred = policy(states)
        value_pred = value_pred.squeeze(-1)
        action_prob = F.softmax(action_pred, dim = -1)
        dist = distributions.Categorical(action_prob)
        
        #new log prob using old actions
        new_log_prob_actions = dist.log_prob(actions)
        
        policy_ratio = (new_log_prob_actions - log_prob_actions).exp()
                
        policy_loss_1 = policy_ratio * advantages
        policy_loss_2 = torch.clamp(policy_ratio, min = 1.0 - ppo_clip, max = 1.0 + ppo_clip) * advantages
        
        policy_loss = - torch.min(policy_loss_1, policy_loss_2).mean().to(device)
        
        value_loss = F.smooth_l1_loss(returns, value_pred).mean().to(device)
    
        optimizer.zero_grad()

        policy_loss.backward()
        value_loss.backward()

        optimizer.step()

5.2.1 Save Video (PPO) ⚙️¶

In [14]:
def save_video_PPO(policy, file_name, model_ckpt= 'checkpoint_best.pth',render_mode="rgb_array", max_t = 1000, seed = 0):
    env = gym.make('LunarLander-v2',enable_wind=True,render_mode="rgb_array")
    vid = video_recorder.VideoRecorder(env, path="video/{}.mp4".format(file_name))
    policy.load_state_dict(torch.load('./models/' + model_ckpt))
    state = env.reset(seed = seed)[0]
    done = False
    t = 0
    rewards = 0
    while not done and t != max_t:
        t += 1
        frame = env.render()
        vid.capture_frame()
        
        state = torch.FloatTensor(state).unsqueeze(0)
        action_pred, _ = policy(state)
        action_prob = F.softmax(action_pred, dim = -1)
        dist = distributions.Categorical(action_prob)
        action = dist.sample()

        state, reward, done, _, _ = env.step(action.item())
        rewards += reward
    env.close()
    return rewards

5.3 Training Algorithm Code (PPO)⚙️¶


  • Back to content table
In [15]:
def train_policy(env,policy, optimizer, 
                 discount_factor=0.99, ppo_steps=5, ppo_clip=0.2,
                 n_episodes=1000, max_t=1000, model_name='PPO_ActorCritic',
                 display_every=100):
    np.random.seed(0)
    # Put model to train
    policy.train()
    # Metris variables
    scores = []                        # list containing scores from each episode
    scores_SMA100 = []
    scores_window = deque(maxlen=100)  # last 100 scores
    time_taken = []
    time_taken_window = deque(maxlen=100)
    success_rate = deque(maxlen=100)
    landing_rate = deque(maxlen=100)
    success_rate_SMA100 = []
    landing_rate_SMA100 = []
    for i_episode in tqdm(range(1, n_episodes+1)):
        score = 0
        state = env.reset()[0]
        
        # Policy variables
        states = []
        actions = []
        log_prob_actions = []
        values = []
        rewards = []

        for t in range(max_t):
            
            state = torch.FloatTensor(state).unsqueeze(0)

            #append state here, not after we get the next state from env.step()
            states.append(state)
            action_pred, value_pred = policy(state)
            action_prob = F.softmax(action_pred, dim = -1)
            dist = distributions.Categorical(action_prob)
            action = dist.sample()
            log_prob_action = dist.log_prob(action)
            state, reward, done, _, _ = env.step(action.item())

            actions.append(action)
            log_prob_actions.append(log_prob_action)
            values.append(value_pred)
            rewards.append(reward)

            score += reward
            if done:
                if score >= 200:
                    success_rate.append(1)
                    landing_rate.append(1)
                elif score >= 120:
                    success_rate.append(0)
                    landing_rate.append(1)
                else:
                    success_rate.append(0)
                    landing_rate.append(0)
                break
        ### Record Metrics ###
        scores_window.append(score)       # save most recent score
        scores_SMA100.append(np.mean(scores_window))
        scores.append(score)              # save most recent score
        time_taken.append(t)
        time_taken_window.append(t)
        landing_rate_SMA100.append(landing_rate.count(1))
        success_rate_SMA100.append(success_rate.count(1))

        states = torch.cat(states)
        actions = torch.cat(actions)    
        log_prob_actions = torch.cat(log_prob_actions)
        values = torch.cat(values).squeeze(-1)
        returns = calculate_returns(rewards, discount_factor)
        advantages = calculate_advantages(returns, values)
        update_policy(policy, states, actions, log_prob_actions, advantages, returns, optimizer, ppo_steps, ppo_clip)
        
        if i_episode % display_every == 0:
            # SMA100: Average of past 100 period (Simple Moving Average)
            print(f'\rEpisode {i_episode}\tAvg Score (SMA100): {np.mean(scores_window):.3f} Current Score: {scores_window[-1]:.0f}\nAvg Episode Length (SMA100): {np.mean(time_taken_window)} Current Episode Length: {time_taken_window[-1]:.0f}\nLanding Rate: {landing_rate.count(1):.0f}% | Success Rate: {success_rate.count(1):.0f}%\n')
            torch.save(policy.state_dict(), f'.\models\{model_name+str(i_episode)}_train.pth')
        if i_episode % (display_every*2) == 0:
            save_video_PPO(policy, 'PPO', f'{model_name+str(i_episode)}_train.pth',max_t=max_t)
        elif i_episode % (display_every*2+1) == 0:
            show_video('PPO', 200)
        if np.mean(scores_window)>=200.0:
            print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)))
            torch.save(policy.state_dict(), f'.\models\{model_name}_best.pth')
            break
    return {
        'scores': scores, 'scores_SMA100': scores_SMA100,
        'scores_window': scores_window, 'time_taken': time_taken,
        'time_taken_window': time_taken_window, 'success_rate': success_rate, 
        'landing_rate': landing_rate, 'landing_rate_SMA100': landing_rate_SMA100,
        'success_rate_SMA100': success_rate_SMA100
    }

5.4 Training /w Actor & Critic (PPO)🤖¶

Metrics Legend:

  • Avg Episode Length (SMA100): Average Episode Time Frame For The Past 100 Episodes
  • Current Score: Current Reward For The Given Episode
  • Landing Rate: Estimated % Of Agent Landing For The Past 100 Episodes (Reward >= 120)
  • Success Rate: Estimated % Of Agent Landing In The Center For The Past 100 Episodes (Reward >= 200)

  • Back to content table
In [244]:
actor = MLP(8, 128, 4).to(device)
critic = MLP(8, 128, 1).to(device)
policy = ActorCritic(actor, critic)
policy.apply(init_weights)
optimizer = optim.Adam(policy.parameters(), lr = LR)

np.random.seed(0)
env = gym.make('LunarLander-v2',enable_wind=True)

results_PPO = train_policy(env, policy, optimizer, 0.99, 5, 0.2, 2000, 1000,'PPO_Actor_Critic',300)
  0%|          | 0/2000 [00:00<?, ?it/s]
Episode 300	Avg Score (SMA100): -114.343 Current Score: -231
Avg Episode Length (SMA100): 409.63 Current Episode Length: 430
Landing Rate: 6% | Success Rate: 0%

Episode 600	Avg Score (SMA100): -3.890 Current Score: 33
Avg Episode Length (SMA100): 626.34 Current Episode Length: 193
Landing Rate: 5% | Success Rate: 1%

Moviepy - Building video video/PPO.mp4.
Moviepy - Writing video video/PPO.mp4

t:   0%|          | 0/257 [00:00<?, ?it/s, now=None]
t:  20%|█▉        | 51/257 [00:00<00:00, 505.48it/s, now=None]
t:  68%|██████▊   | 174/257 [00:00<00:00, 925.39it/s, now=None]
                                                               
Moviepy - Done !
Moviepy - video ready video/PPO.mp4
Episode 900	Avg Score (SMA100): 40.169 Current Score: -14
Avg Episode Length (SMA100): 773.39 Current Episode Length: 191
Landing Rate: 24% | Success Rate: 6%

Episode 1200	Avg Score (SMA100): 17.723 Current Score: -48
Avg Episode Length (SMA100): 737.13 Current Episode Length: 628
Landing Rate: 44% | Success Rate: 7%

Moviepy - Building video video/PPO.mp4.
Moviepy - Writing video video/PPO.mp4

t:   0%|          | 0/1000 [00:00<?, ?it/s, now=None]
t:   6%|▌         | 59/1000 [00:00<00:01, 586.28it/s, now=None]
t:  19%|█▉        | 190/1000 [00:00<00:00, 1008.61it/s, now=None]
t:  32%|███▏      | 320/1000 [00:00<00:00, 1140.16it/s, now=None]
t:  46%|████▌     | 455/1000 [00:00<00:00, 1221.67it/s, now=None]
t:  59%|█████▉    | 590/1000 [00:00<00:00, 1265.10it/s, now=None]
t:  73%|███████▎  | 728/1000 [00:00<00:00, 1300.41it/s, now=None]
t:  86%|████████▌ | 862/1000 [00:00<00:00, 1311.14it/s, now=None]
t: 100%|█████████▉| 999/1000 [00:00<00:00, 1328.23it/s, now=None]
                                                                 
Moviepy - Done !
Moviepy - video ready video/PPO.mp4
Episode 1500	Avg Score (SMA100): 29.085 Current Score: 114
Avg Episode Length (SMA100): 615.67 Current Episode Length: 999
Landing Rate: 39% | Success Rate: 6%

Episode 1800	Avg Score (SMA100): 48.599 Current Score: 55
Avg Episode Length (SMA100): 812.39 Current Episode Length: 999
Landing Rate: 32% | Success Rate: 3%

Moviepy - Building video video/PPO.mp4.
Moviepy - Writing video video/PPO.mp4

t:   0%|          | 0/720 [00:00<?, ?it/s, now=None]
t:   3%|▎         | 25/720 [00:00<00:02, 246.27it/s, now=None]
t:  12%|█▏        | 86/720 [00:00<00:01, 457.53it/s, now=None]
t:  30%|██▉       | 215/720 [00:00<00:00, 834.97it/s, now=None]
t:  48%|████▊     | 345/720 [00:00<00:00, 1015.84it/s, now=None]
t:  66%|██████▋   | 477/720 [00:00<00:00, 1123.03it/s, now=None]
t:  84%|████████▍ | 605/720 [00:00<00:00, 1175.78it/s, now=None]
                                                                
Moviepy - Done !
Moviepy - video ready video/PPO.mp4
In [364]:
actor = MLP(8, 128, 4).to(device)
critic = MLP(8, 128, 1).to(device)
policy = ActorCritic(actor, critic)
policy.apply(init_weights)
optimizer = optim.Adam(policy.parameters(), lr = LR)

np.random.seed(0)
env = gym.make('LunarLander-v2',enable_wind=True)

results_PPO = train_policy(env, policy, optimizer, 0.99, 5, 0.2, 2000, 1000,'PPO_Actor_Critic',150)
  0%|          | 0/2000 [00:00<?, ?it/s]
Episode 150	Avg Score (SMA100): -276.004 Current Score: -35
Avg Episode Length (SMA100): 103.54 Current Episode Length: 91
Landing Rate: 0% | Success Rate: 0%

Episode 300	Avg Score (SMA100): -170.954 Current Score: -295
Avg Episode Length (SMA100): 412.22 Current Episode Length: 481
Landing Rate: 0% | Success Rate: 0%

Moviepy - Building video video/PPO.mp4.
Moviepy - Writing video video/PPO.mp4

t:   0%|          | 0/203 [00:00<?, ?it/s, now=None]
t:  26%|██▌       | 53/203 [00:00<00:00, 520.22it/s, now=None]
t:  82%|████████▏ | 167/203 [00:00<00:00, 880.86it/s, now=None]
                                                               
Moviepy - Done !
Moviepy - video ready video/PPO.mp4
Episode 450	Avg Score (SMA100): -67.057 Current Score: -3
Avg Episode Length (SMA100): 645.19 Current Episode Length: 300
Landing Rate: 2% | Success Rate: 0%

Episode 600	Avg Score (SMA100): 14.330 Current Score: 91
Avg Episode Length (SMA100): 733.44 Current Episode Length: 999
Landing Rate: 10% | Success Rate: 2%

Moviepy - Building video video/PPO.mp4.
Moviepy - Writing video video/PPO.mp4

t:   0%|          | 0/324 [00:00<?, ?it/s, now=None]
t:  19%|█▊        | 60/324 [00:00<00:00, 596.32it/s, now=None]
t:  55%|█████▌    | 179/324 [00:00<00:00, 937.27it/s, now=None]
t:  90%|█████████ | 293/324 [00:00<00:00, 1025.41it/s, now=None]
                                                                
Moviepy - Done !
Moviepy - video ready video/PPO.mp4
Episode 750	Avg Score (SMA100): 51.255 Current Score: 98
Avg Episode Length (SMA100): 633.99 Current Episode Length: 999
Landing Rate: 23% | Success Rate: 4%

Episode 900	Avg Score (SMA100): 62.007 Current Score: 191
Avg Episode Length (SMA100): 729.35 Current Episode Length: 291
Landing Rate: 35% | Success Rate: 8%

Moviepy - Building video video/PPO.mp4.
Moviepy - Writing video video/PPO.mp4

t:   0%|          | 0/392 [00:00<?, ?it/s, now=None]
t:  16%|█▌        | 62/392 [00:00<00:00, 611.80it/s, now=None]
t:  47%|████▋     | 185/392 [00:00<00:00, 969.28it/s, now=None]
t:  78%|███████▊  | 306/392 [00:00<00:00, 1074.98it/s, now=None]
                                                                
Moviepy - Done !
Moviepy - video ready video/PPO.mp4
Episode 1050	Avg Score (SMA100): -13.426 Current Score: 125
Avg Episode Length (SMA100): 660.98 Current Episode Length: 591
Landing Rate: 30% | Success Rate: 4%

Episode 1200	Avg Score (SMA100): -52.430 Current Score: -116
Avg Episode Length (SMA100): 626.09 Current Episode Length: 611
Landing Rate: 21% | Success Rate: 2%

Moviepy - Building video video/PPO.mp4.
Moviepy - Writing video video/PPO.mp4

t:   0%|          | 0/755 [00:00<?, ?it/s, now=None]
t:   9%|▉         | 69/755 [00:00<00:01, 684.86it/s, now=None]
t:  27%|██▋       | 202/755 [00:00<00:00, 1058.32it/s, now=None]
t:  45%|████▍     | 336/755 [00:00<00:00, 1184.66it/s, now=None]
t:  62%|██████▏   | 467/755 [00:00<00:00, 1233.71it/s, now=None]
t:  79%|███████▉  | 598/755 [00:00<00:00, 1259.08it/s, now=None]
t:  97%|█████████▋| 729/755 [00:00<00:00, 1274.51it/s, now=None]
                                                                
Moviepy - Done !
Moviepy - video ready video/PPO.mp4
Episode 1350	Avg Score (SMA100): -101.266 Current Score: -180
Avg Episode Length (SMA100): 644.47 Current Episode Length: 115
Landing Rate: 14% | Success Rate: 1%

Episode 1500	Avg Score (SMA100): -25.547 Current Score: -177
Avg Episode Length (SMA100): 604.42 Current Episode Length: 706
Landing Rate: 18% | Success Rate: 3%

Moviepy - Building video video/PPO.mp4.
Moviepy - Writing video video/PPO.mp4

t:   0%|          | 0/988 [00:00<?, ?it/s, now=None]
t:   5%|▌         | 51/988 [00:00<00:01, 506.85it/s, now=None]
t:  18%|█▊        | 175/988 [00:00<00:00, 935.35it/s, now=None]
t:  30%|███       | 297/988 [00:00<00:00, 1064.84it/s, now=None]
t:  43%|████▎     | 423/988 [00:00<00:00, 1138.81it/s, now=None]
t:  56%|█████▌    | 549/988 [00:00<00:00, 1178.84it/s, now=None]
t:  68%|██████▊   | 676/988 [00:00<00:00, 1207.90it/s, now=None]
t:  82%|████████▏ | 807/988 [00:00<00:00, 1238.25it/s, now=None]
t:  95%|█████████▌| 940/988 [00:00<00:00, 1266.48it/s, now=None]
                                                                
Moviepy - Done !
Moviepy - video ready video/PPO.mp4
Episode 1650	Avg Score (SMA100): -278.113 Current Score: -329
Avg Episode Length (SMA100): 683.42 Current Episode Length: 178
Landing Rate: 1% | Success Rate: 0%

Episode 1800	Avg Score (SMA100): -134.757 Current Score: -44
Avg Episode Length (SMA100): 758.98 Current Episode Length: 999
Landing Rate: 3% | Success Rate: 0%

Moviepy - Building video video/PPO.mp4.
Moviepy - Writing video video/PPO.mp4

t:   0%|          | 0/1000 [00:00<?, ?it/s, now=None]
t:   6%|▌         | 58/1000 [00:00<00:01, 576.27it/s, now=None]
t:  20%|█▉        | 196/1000 [00:00<00:00, 1046.21it/s, now=None]
t:  33%|███▎      | 330/1000 [00:00<00:00, 1178.33it/s, now=None]
t:  45%|████▌     | 454/1000 [00:00<00:00, 1200.67it/s, now=None]
t:  59%|█████▉    | 590/1000 [00:00<00:00, 1252.87it/s, now=None]
t:  72%|███████▏  | 723/1000 [00:00<00:00, 1278.88it/s, now=None]
t:  86%|████████▌ | 860/1000 [00:00<00:00, 1305.59it/s, now=None]
                                                                 
Moviepy - Done !
Moviepy - video ready video/PPO.mp4
Episode 1950	Avg Score (SMA100): 78.886 Current Score: 78
Avg Episode Length (SMA100): 789.49 Current Episode Length: 999
Landing Rate: 38% | Success Rate: 5%

In [ ]:
saveJSON(results_PPO, 'results_PPO.json')

5.5 Evaluate Actor & Critic /w PPO Performance 🔬¶


  • Back to content table
In [368]:
show_video('PPO')
In [367]:
# Plot result
results_PPO = loadJSON('results_PPO.json')
sns.set_style("whitegrid")
plotResult(results_PPO, [['success_rate_SMA100', 'landing_rate_SMA100'], ['scores_SMA100']])
Past 2000 Episodes
====================================
Final success_rate_SMA100: 5.00
Final landing_rate_SMA100: 51.00
Final scores_SMA100: 75.81

Observation:
We observed that the agent began to learn how to land with both feets around Episode 400 and successfully landed within the flags by Episode 420. The landing rate improved rapidly, outpacing the success rate around Episode 500.

Compared to other plots, the training landing rate and success rate of the agent appear to be unstable, with a decline starting from Episode 850. This trend continues until Episode 1750, which might be attributed to the nature of PPO clipping. These results indicate that the Actor & Critic PPO may not be the best choice, particularly in an environment with only 4 actions.

6.0 DQN + Prioritized Experience Replay - Improving Our Best Candidate¶

  • DQN as candidate model, as it is able to solve within the shortest epoch

Prioritized Experience Replay (PER) is a modification to the traditional Experience Replay (ER) algorithm in Reinforcement Learning. The main difference between the two is the way experiences are stored and sampled from the replay buffer.

In ER, experiences are stored in a fixed-size buffer and randomly sampled from this buffer to train the agent. This can result in inefficient use of the experiences, as the agent may repeatedly sample low-impact experiences, while high-impact experiences are neglected.

In PER, experiences are assigned a priority value based on their estimated impact on the agent's learning. High-impact experiences are assigned a higher priority, and are therefore more likely to be sampled and used to update the agent's policy. This leads to more efficient use of experiences, as the agent focuses on learning from the most impactful experiences.

PER can result in improved performance compared to ER, as the agent is able to learn more effectively from the experiences that are most valuable to its learning process.


Resources: PER Original Implementation, 2015


  • Back to content table

6.1 Adding PER into our DQN Agent 🤖¶


  • Back to content table

Code for PER

In [112]:
class PriortizationReplayBuffer:
    """Fixed-size buffer to store experience tuples."""

    def __init__(self, state_dim, action_dim, buffer_size, batch_size, priority=False):
        """Initialize a ReplayBuffer object.
        Params
        ======
            action_dim (int): dimension of each action
            buffer_dim (int): maximum size of buffer (chosen as multiple of num agents)
            batch_size (int): size of each training batch
            seed (int): random seed
        """
        self.states = torch.zeros((buffer_size,)+(state_dim,)).to(device)
        self.next_states = torch.zeros((buffer_size,)+(state_dim,)).to(device)
        self.actions = torch.zeros(buffer_size,1, dtype=torch.long).to(device)
        self.rewards = torch.zeros(buffer_size, 1, dtype=torch.float).to(device)
        self.dones = torch.zeros(buffer_size, 1, dtype=torch.float).to(device)
        self.e = np.zeros((buffer_size, 1), dtype=np.float32)
        
        self.priority = priority

        self.ptr = 0
        self.n = 0
        self.buffer_size = buffer_size
        self.batch_size = batch_size
    
    def add(self, state, action, reward, next_state, done):
        """Add a new experience to memory."""
        self.states[self.ptr] = torch.from_numpy(state).to(device)
        self.next_states[self.ptr] = torch.from_numpy(next_state).to(device)
        
        self.actions[self.ptr] = torch.from_numpy(np.asarray(action)).to(device)
        self.rewards[self.ptr] = torch.from_numpy(np.asarray(reward)).to(device)
        self.dones[self.ptr] = done
        
        self.ptr += 1
        if self.ptr >= self.buffer_size:
            self.ptr = 0
            self.n = self.buffer_size

    def sample(self, get_all=False):
        """Randomly sample a batch of experiences from memory."""
        n = len(self)
        if get_all:
            return self.states[:n], self.actions[:n], self.rewards[:n], self.next_states[:n], self.dones[:n]
        # else:
        if self.priority:
            idx = np.random.choice(n, self.batch_size, replace=False, p=self.e)
        else:
            idx = np.random.choice(n, self.batch_size, replace=False)
        
        states = self.states[idx]
        next_states = self.next_states[idx]
        actions = self.actions[idx]
        rewards = self.rewards[idx]
        dones = self.dones[idx]
        
        return (states, actions, rewards, next_states, dones), idx
      
    def update_error(self, e, idx=None):
        e = torch.abs(e.detach())
        e = e / e.sum()
        if idx is not None:
            self.e[idx] = e.cpu().numpy()
        else:
            self.e[:len(self)] = e.cpu().numpy()
        
    def __len__(self):
        if self.n == 0:
            return self.ptr
        else:
            return self.n

Edited Agent to incorporate PER

In [101]:
class PTRAgent:
    """Interacts with and learns from the environment."""

    def __init__(self, state_size, action_size, hidden_dim,network, LR, weight_decay,priority=True):
        """Initialize an Agent object.
        
        Params
        ======
            state_size (int): dimension of each state
            action_size (int): dimension of each action
            seed (int): random seed
        """
        self.state_size = state_size
        self.action_size = action_size

        self.qnetwork_online = network(state_size, action_size, hidden_dim).to(device)
        self.qnetwork_target = network(state_size, action_size, hidden_dim).to(device)
        
        self.optimizer = optim.Adam(self.qnetwork_online.parameters(), lr=LR, weight_decay=weight_decay)

        # Replay memory
        self.memory = PriortizationReplayBuffer(state_size, (action_size,), BUFFER_SIZE, BATCH_SIZE)
        # Initialize time step (for updating every UPDATE_EVERY steps)
        self.t_step = 0
    
    def step(self, state, action, reward, next_state, done):
        # Save experience in replay memory
        self.memory.add(state, action, reward, next_state, done)
        
        # Learn every UPDATE_EVERY time steps.
        self.t_step = (self.t_step + 1) % UPDATE_EVERY
        if self.t_step == 0:
            # If enough samples are available in memory, get random subset and learn
            if len(self.memory) > BATCH_SIZE:
                experiences, idx = self.memory.sample()
                e = self.learn(experiences)
                self.memory.update_error(e, idx)

    def act(self, state, eps=0.):
        """Returns actions for given state as per current policy.
        
        Params
        ======
            state (array_like): current state
            eps (float): epsilon, for epsilon-greedy action selection
        """
        state = torch.from_numpy(state).float().unsqueeze(0).to(device)
        self.qnetwork_online.eval()
        with torch.no_grad():
            action_values = self.qnetwork_online(state)
        self.qnetwork_online.train()

        # Epsilon-greedy action selection
        if random.random() > eps:
            return np.argmax(action_values.cpu().data.numpy())
        else:
            return random.choice(np.arange(self.action_size))
          
    def update_error(self):
        states, actions, rewards, next_states, dones = self.memory.sample(get_all=True)
        with torch.no_grad():
            maxQ = self.qnetwork_target(next_states).max(-1, keepdim=True)[0]
            target = rewards+GAMMA*maxQ*(1-dones)
            old_val = self.qnetwork_online(states).gather(-1, actions)
            e = old_val - target
            self.memory.update_error(e)

    def learn(self, experiences):
        """Update value parameters using given batch of experience tuples.
        Params
        ======
            experiences (Tuple[torch.Variable]): tuple of (s, a, r, s', done) tuples 
            gamma (float): discount factor
        """
        states, actions, rewards, next_states, dones = experiences

        ## compute and minimize the loss
        self.optimizer.zero_grad()

        with torch.no_grad():
            maxQ = self.qnetwork_target(next_states).max(-1, keepdim=True)[0]
            target = rewards+GAMMA*maxQ*(1-dones)
        old_val = self.qnetwork_online(states).gather(-1, actions)   
        
        loss = F.mse_loss(old_val, target)
        loss.backward()
        self.optimizer.step()

        # ------------------- update target network ------------------- #
        self.soft_update(self.qnetwork_online, self.qnetwork_target, TAU) 
        
        return old_val - target


    def soft_update(self, local_model, target_model, tau):
        """Soft update model parameters.
        θ_target = τ*θ_local + (1 - τ)*θ_target
        Params
        ======
            local_model (PyTorch model): weights will be copied from
            target_model (PyTorch model): weights will be copied to
            tau (float): interpolation parameter 
        """
        for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
            target_param.data.copy_(tau*local_param.data + (1.0-tau)*target_param.data)

6.2 Training DQN + PER 🤖¶

Metrics Legend:

  • Avg Episode Length (SMA100): Average Episode Time Frame For The Past 100 Episodes
  • Current Score: Current Reward For The Given Episode
  • Landing Rate: Estimated % Of Agent Landing For The Past 100 Episodes (Reward >= 120)
  • Success Rate: Estimated % Of Agent Landing In The Center For The Past 100 Episodes (Reward >= 200)

  • Back to content table
In [62]:
# Reset seed - Use same seed for all experiments (objective comparison)
np.random.seed(0)
env = gym.make('LunarLander-v2',enable_wind=True)

# DQN + PTR
agent = PTRAgent(8, 4, hidden_dim=64, network=QNetwork, 0.0005, 0.0000001)
results_DQN_PTR = train_agent(display_every=200, max_t=1000, video_filepath='DQN_PTR')
C:\Users\quahj\AppData\Local\Temp\ipykernel_11752\3117275958.py:18: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  self.e = np.zeros((buffer_size, 1), dtype=np.float)
  0%|          | 0/3000 [00:00<?, ?it/s]
Episode 200	Avg Score (SMA100): -210.796 Current Score: -184
Avg Episode Length (SMA100): 136.32 Current Episode Length: 206.000
Landing Rate: 0% | Success Rate: 0%

Episode 400	Avg Score (SMA100): -105.701 Current Score: -270
Avg Episode Length (SMA100): 445.82 Current Episode Length: 747.000
Landing Rate: 2% | Success Rate: 0%

Moviepy - Building video video/DQN_PTR.mp4.
Moviepy - Writing video video/DQN_PTR.mp4

t:   0%|          | 0/912 [00:00<?, ?it/s, now=None]
t:   4%|▍         | 38/912 [00:00<00:02, 376.63it/s, now=None]
t:  13%|█▎        | 121/912 [00:00<00:01, 640.01it/s, now=None]
t:  24%|██▍       | 217/912 [00:00<00:00, 782.22it/s, now=None]
t:  33%|███▎      | 299/912 [00:00<00:00, 791.88it/s, now=None]
t:  42%|████▏     | 379/912 [00:00<00:00, 769.34it/s, now=None]
t:  50%|█████     | 459/912 [00:00<00:00, 779.45it/s, now=None]
t:  59%|█████▉    | 538/912 [00:00<00:00, 775.83it/s, now=None]
t:  68%|██████▊   | 621/912 [00:00<00:00, 790.60it/s, now=None]
t:  78%|███████▊  | 713/912 [00:00<00:00, 829.19it/s, now=None]
t:  88%|████████▊ | 803/912 [00:01<00:00, 848.91it/s, now=None]
t:  99%|█████████▉| 903/912 [00:01<00:00, 893.91it/s, now=None]
                                                               
Moviepy - Done !
Moviepy - video ready video/DQN_PTR.mp4
Episode 600	Avg Score (SMA100): -96.389 Current Score: -34
Avg Episode Length (SMA100): 511.77 Current Episode Length: 999.000
Landing Rate: 23% | Success Rate: 8%

Episode 800	Avg Score (SMA100): -95.798 Current Score: -184
Avg Episode Length (SMA100): 621.54 Current Episode Length: 558.000
Landing Rate: 18% | Success Rate: 10%

Moviepy - Building video video/DQN_PTR.mp4.
Moviepy - Writing video video/DQN_PTR.mp4

t:   0%|          | 0/1000 [00:00<?, ?it/s, now=None]
t:   6%|▌         | 60/1000 [00:00<00:01, 598.05it/s, now=None]
t:  18%|█▊        | 178/1000 [00:00<00:00, 937.84it/s, now=None]
t:  30%|███       | 302/1000 [00:00<00:00, 1071.69it/s, now=None]
t:  42%|████▏     | 420/1000 [00:00<00:00, 1114.21it/s, now=None]
t:  54%|█████▍    | 541/1000 [00:00<00:00, 1146.66it/s, now=None]
t:  66%|██████▌   | 656/1000 [00:00<00:00, 1129.18it/s, now=None]
t:  77%|███████▋  | 773/1000 [00:00<00:00, 1141.99it/s, now=None]
t:  89%|████████▉ | 889/1000 [00:00<00:00, 1141.97it/s, now=None]
                                                                 
Moviepy - Done !
Moviepy - video ready video/DQN_PTR.mp4
Episode 1000	Avg Score (SMA100): -51.242 Current Score: -118
Avg Episode Length (SMA100): 835.1 Current Episode Length: 999.000
Landing Rate: 23% | Success Rate: 13%

Episode 1200	Avg Score (SMA100): 81.103 Current Score: 54
Avg Episode Length (SMA100): 457.67 Current Episode Length: 130.000
Landing Rate: 56% | Success Rate: 32%

Moviepy - Building video video/DQN_PTR.mp4.
Moviepy - Writing video video/DQN_PTR.mp4

t:   0%|          | 0/436 [00:00<?, ?it/s, now=None]
t:  11%|█         | 47/436 [00:00<00:00, 465.57it/s, now=None]
t:  34%|███▍      | 150/436 [00:00<00:00, 791.83it/s, now=None]
t:  62%|██████▏   | 269/436 [00:00<00:00, 971.22it/s, now=None]
t:  89%|████████▉ | 389/436 [00:00<00:00, 1060.70it/s, now=None]
                                                                
Moviepy - Done !
Moviepy - video ready video/DQN_PTR.mp4
Episode 1400	Avg Score (SMA100): 165.125 Current Score: 248
Avg Episode Length (SMA100): 369.12 Current Episode Length: 408.000
Landing Rate: 80% | Success Rate: 70%

Episode 1600	Avg Score (SMA100): 154.341 Current Score: 234
Avg Episode Length (SMA100): 345.35 Current Episode Length: 382.000
Landing Rate: 74% | Success Rate: 66%

Moviepy - Building video video/DQN_PTR.mp4.
Moviepy - Writing video video/DQN_PTR.mp4

t:   0%|          | 0/329 [00:00<?, ?it/s, now=None]
t:  18%|█▊        | 59/329 [00:00<00:00, 585.81it/s, now=None]
t:  50%|████▉     | 163/329 [00:00<00:00, 851.35it/s, now=None]
t:  78%|███████▊  | 257/329 [00:00<00:00, 887.86it/s, now=None]
                                                               
Moviepy - Done !
Moviepy - video ready video/DQN_PTR.mp4
Environment solved in 1761 episodes!

Avg Score (SMA100): 200.481 Current Score: 253
Avg Episode Length (SMA100): 357.63 Current Episode Length: 617.000
Landing Rate: 86% | Success Rate: 72%

6.3 Evaluate DQN + PER Peformance🔬¶


  • Back to content table
In [64]:
saveJSON(results_DQN_PTR, 'dict_DQN_PTR.json')
In [90]:
show_video('DQN_PTR')
In [93]:
# Plot result
dict_DQN_PTR = loadJSON('dict_DQN_PTR.json')
sns.set_style("whitegrid")
plotResult(dict_DQN_PTR, [['success_rate_SMA100', 'landing_rate_SMA100'], ['scores_SMA100']])
Past 1761 Episodes
====================================
Final success_rate_SMA100: 72.00
Final landing_rate_SMA100: 86.00
Final scores_SMA100: 200.48

Observation:
Our observation showed that the agent started to learn how to land with both feet around Episode 400 and eventually succeeded in landing within the flags several episodes later. Compared to prior models, this DQN + PER combination appeared to be enhancing both the landing and success rate at a similar rate. The variance between these two rates throughout the episodes was much lower, indicating that the agent is efficiently and effectively learning both simultaneously.

Also, this network was able to solve the environment in the shortest number of episodes so far (1761).

7.0 Hyperparameter Tuning 🤓¶

As our DQN + PER managed to solve the environment in the least number of episodes out of all the models we tried, we decided to optimize its hyperparameter.

  • Candidate model -> DQN + PER
  • Metrics -> Lowest epochs needed to solve the environment

Parameters to tune:

  • Discount Factor (gamma)
  • Max_t -> Maximum time frame before terminating episode
  • Learning Rate
  • Weight Decay (Noise)
  • Complexity (Hidden Layers)

*Note that there are many more possible hyperparameters to tune. Ultimately, our choice came down to the 5 hyperparameters that we think would play quite a big role in making our agent train better.


  • Back to content table
In [231]:
def NetRandomTuner(LR_range = np.logspace(1,2,num=4)/100000 * 0.5,max_alive=[600,800,1000],
                   model_layers=[32,64,128,160],discount_factor=[0.985,0.9875,0.99,0.9925,0.995], 
                   weight_decay=np.logspace(1,4,num=4)/1000000000,trials=10):
    global GAMMA
    possible_trials=[]
    for LR, MA, DF, ML, WD in itertools.product(*(LR_range,max_alive,discount_factor, model_layers,weight_decay)):
        possible_trials.append([LR, MA, DF, ML, WD])
    # shuffle all possible trials
    random.shuffle(possible_trials)
    epoch_hist = []
    trial_hist = []
    best_parms = [99999]
    trial_count = 0
    t0 = time.time()
    
    for trial in possible_trials:
        print('Next trial: ',trial)
        t1 = time.time()
        if trial_count == trials:
            print(f'\n\nTrial ended at trial #{trial_count}')
            break
        trial_count += 1

        np.random.seed(0)
        env = gym.make('LunarLander-v2',enable_wind=True)
        env.reset(seed = 0)
        
        GAMMA = trial[2]
        
        agent = PTRAgent(8, 4, hidden_dim=trial[3], LR=trial[0], weight_decay=trial[4], network=QNetwork)
        results_DQN_PTR = train_agent(display_every=200, max_t=trial[1], n_episodes=1500)

        if len(results_DQN_PTR['scores']) < best_parms[0]:
            checkpoint_best = deepcopy(agent)
            best_parms=[len(results_DQN_PTR['scores']),trial[0],trial[1],trial[2],trial[3],trial[4],trial_count]
            best_results = results_DQN_PTR
        clear_output()
        print(f'''
Trial #{trial_count} Finished - Search Time {(time.time()-t1)/60:.2f} Mins
Total Time Elapsed: {(time.time()-t0)/60:.2f} Mins\n
Hyperparameters\t\t|Trial Values: #{trial_count}\t|Best Trial Values: #{best_parms[-1]}\n
Learning Rate\t\t|{trial[0]:.7f}\t\t|{best_parms[1]:.7f}
Max Time Alive\t\t|{trial[1]:.0f}\t\t\t|{best_parms[2]:.0f}
Discount Factor\t\t|{trial[2]:.4f}\t\t\t|{best_parms[3]:.4f}
Model Layers\t\t|{trial[3]:.0f}\t\t\t|{best_parms[4]:.0f}
Weight Decay\t\t|{trial[4]:.8f}\t\t|{best_parms[5]:.8f}
Epochs To Solve\t\t|{len(results_DQN_PTR['scores'])}\t\t\t|{best_parms[0]}\n\n
''')
    return best_results, checkpoint_best

7.1 Running Hyperparameter Tuner 🤖¶


  • Back to content table
In [232]:
best_results, best_agent = NetRandomTuner(trials=30)
Trial #30 Finished - Search Time 7.85 Mins
Total Time Elapsed: 611.33 Mins

Hyperparameters		|Trial Values: #30	|Best Trial Values: #29

Learning Rate		|0.0005000		|0.0001077
Max Time Alive		|600			|800
Discount Factor		|0.9925			|0.9875
Model Layers		|32			|32
Weight Decay		|0.00000010		|0.00000100
Epoch To Solve		|915			|840



Next trial:  [0.00010772173450159416, 1000, 0.985, 64, 1e-08]


Trial ended at trial #30
In [247]:
saveJSON(best_results, 'DQN_PER_tuned.json')
In [352]:
# torch.save(best_agent.qnetwork_online.state_dict(), f'best_hypertuned_DQN_PER.pth')
torch.save(best_agent.qnetwork_target.state_dict(), f'best_hypertuned_DQN_PER_target.pth')

7.2 Evaluate Hyperparameter Tuned DQN + PER 🔬¶


  • Back to content table
In [248]:
# Plot result
DQN_PER = loadJSON('DQN_PER_tuned.json')
sns.set_style("whitegrid")
plotResult(DQN_PER, [['success_rate_SMA100', 'landing_rate_SMA100'], ['scores_SMA100']])
Past 840 Episodes
====================================
Final success_rate_SMA100: 71.00
Final landing_rate_SMA100: 97.00
Final scores_SMA100: 200.67

Observation:
After hyperparameter tuning, our model was able to solve the environment in only 840 Episodes, which is a significant improvement from before.

Our observation showed that the agent started to learn how to land with both feet and solve the environment at around Episode 200. Compared to our DQN + PER before hyperparameter tuning, this tuned model is able to increase its landing rate at a very stable pace. Meaning that the model is learning to land very effectively. Ultimately, this model reached a landing rate (SMA100) of 97% and sucess rate (SMA 100) of 71% in the end.

Note: Different seeds can result in different levels of difficulty for the agent to land the lunar module within the flags. Some seeds may result in an environment that is easier for the agent to learn, while others may result in a more challenging environment. A possible reason for this drastic improvement in result can be partially due to the environment seed.

8.0 Final Evaluation - Objectively Testing¶

  • How to be more objective when it comes to evaluation besides using the metrics Lowest Epochs To Solve Environment?

Our approach:¶

  • Train our models on the same training environment (eg. env seed 1)
  • Test all models on the same testing environment (eg. env seed 2)
  • All models will only have the chance to be trained for 1000 epochs in the same trianing environment.
  • All models will go through 500 episodes of testing in the same testing environment and metrics will be recorded.
  • Scores will then be able to be more objectively comparable since all models are trained the in same training environment and tested with the same testing environment. The only different being the Network used to train our agent.

It is noteworth to point out that there is many other ways to objectively evaluate reinforcement learning depending on the user's needs, for example, time to solve environment could also be used as a metric when it comes to evaluation.


  • Back to content table
In [10]:
def train_agent_1k(max_t: int=1000, eps_start: float=1.0, 
        eps_end: float=0.01, eps_decay: float=0.995, display_every: int=150, model_name='DQN',
        video_filepath='LunarLander_training'):
    '''
    Train a Network agent
    
    Parameters:
        n_episodes (int): Maximum number of episodes for training
        max_t (int): Maximum number of timesteps per episode
        eps_start (float): Initial value of epsilon for epsilon-greedy action selection
        eps_end (float): Minimum value of epsilon
        eps_decay (float): Factor to decrease epsilon per episode
    '''
    shown = False
    scores = []                        # list containing scores from each episode
    scores_SMA100 = []
    scores_window = deque(maxlen=100)  # last 100 scores
    eps = eps_start                    # initialize epsilon
    time_taken = []
    time_taken_window = deque(maxlen=100)
    success_rate = deque(maxlen=100)
    landing_rate = deque(maxlen=100)
    success_rate_SMA100 = []
    landing_rate_SMA100 = []
    for i_episode in tqdm(range(1, 1001)):
        state = env.reset()[0]
        score = 0
        for t in range(max_t):
            action = agent.act(state, eps)
            next_state, reward, done, _, _ = env.step(action)
            agent.step(state, action, reward, next_state, done)
            state = next_state
            score += reward
            if done:
                if score >= 200:
                    success_rate.append(1)
                    landing_rate.append(1)
                elif score >= 120:
                    success_rate.append(0)
                    landing_rate.append(1)
                else:
                    success_rate.append(0)
                    landing_rate.append(0)
                break
        scores_window.append(score)       # save most recent score
        scores_SMA100.append(np.mean(scores_window))
        scores.append(score)              # save most recent score
        time_taken.append(t)
        time_taken_window.append(t)
        landing_rate_SMA100.append(landing_rate.count(1))
        success_rate_SMA100.append(success_rate.count(1))
        eps = max(eps_end, eps_decay*eps) # decrease epsilon
        if i_episode % display_every == 0:
            # SMA100: Average of past 100 period (Simple Moving Average)
            print(f'\rEpisode {i_episode}\tAvg Score (SMA100): {np.mean(scores_window):.3f} Current Score: {scores_window[-1]:.0f}\nAvg Episode Length (SMA100): {np.mean(time_taken_window)} Current Episode Length: {time_taken_window[-1]:.3f}\nLanding Rate: {landing_rate.count(1):.0f}% | Success Rate: {success_rate.count(1):.0f}%\n')
            torch.save(agent.qnetwork_online.state_dict(), f'.\models\{model_name+str(i_episode)}_train.pth')
        if i_episode % (display_every*2) == 0:
            save_video(agent, video_filepath, f'{model_name+str(i_episode)}_train.pth', seed = i_episode)
        elif i_episode % (display_every*2+1) == 0:
            show_video(video_filepath, 200)
        if np.mean(scores_window)>=200.0 and not shown:
            shown = True
            print('\nEnvironment solved in {:d} episodes!'.format(i_episode))
            print(f'\rAvg Score (SMA100): {np.mean(scores_window):.3f} Current Score: {scores_window[-1]:.0f}\nAvg Episode Length (SMA100): {np.mean(time_taken_window)} Current Episode Length: {time_taken_window[-1]:.3f}\nLanding Rate: {landing_rate.count(1):.0f}% | Success Rate: {success_rate.count(1):.0f}%\n')
    
    torch.save(agent.qnetwork_online.state_dict(), f'.\models\{model_name}_best.pth')
    
    return {
        'scores': scores, 'scores_SMA100': scores_SMA100,
        'scores_window': scores_window, 'time_taken': time_taken,
        'time_taken_window': time_taken_window, 'success_rate': success_rate, 
        'landing_rate': landing_rate, 'landing_rate_SMA100': landing_rate_SMA100,
        'success_rate_SMA100': success_rate_SMA100, 'eps':eps
    }
In [11]:
def train_policy_1k(env,policy, optimizer, 
                 discount_factor=0.99, ppo_steps=5, ppo_clip=0.2,
                 max_t=1000, model_name='PPO_ActorCritic',display_every=100):
    # Put model to train
    policy.train()
    # Metris variables
    shown = False
    scores = []                        # list containing scores from each episode
    scores_SMA100 = []
    scores_window = deque(maxlen=100)  # last 100 scores
    time_taken = []
    time_taken_window = deque(maxlen=100)
    success_rate = deque(maxlen=100)
    landing_rate = deque(maxlen=100)
    success_rate_SMA100 = []
    landing_rate_SMA100 = []
    for i_episode in tqdm(range(1, 1001)):
        score = 0
        state = env.reset()[0]
        
        # Policy variables
        states = []
        actions = []
        log_prob_actions = []
        values = []
        rewards = []
        for t in range(max_t):
            state = torch.FloatTensor(state).unsqueeze(0)

            #append state here, not after we get the next state from env.step()
            states.append(state)
            action_pred, value_pred = policy(state)
            action_prob = F.softmax(action_pred, dim = -1)
            dist = distributions.Categorical(action_prob)
            action = dist.sample()
            log_prob_action = dist.log_prob(action)
            state, reward, done, _, _ = env.step(action.item())

            actions.append(action)
            log_prob_actions.append(log_prob_action)
            values.append(value_pred)
            rewards.append(reward)

            score += reward
            if done:
                if score >= 200:
                    success_rate.append(1)
                    landing_rate.append(1)
                elif score >= 120:
                    success_rate.append(0)
                    landing_rate.append(1)
                else:
                    success_rate.append(0)
                    landing_rate.append(0)
                break
        ### Record Metrics ###
        scores_window.append(score)       # save most recent score
        scores_SMA100.append(np.mean(scores_window))
        scores.append(score)              # save most recent score
        time_taken.append(t)
        time_taken_window.append(t)
        landing_rate_SMA100.append(landing_rate.count(1))
        success_rate_SMA100.append(success_rate.count(1))

        states = torch.cat(states)
        actions = torch.cat(actions)    
        log_prob_actions = torch.cat(log_prob_actions)
        values = torch.cat(values).squeeze(-1)
        returns = calculate_returns(rewards, discount_factor)
        advantages = calculate_advantages(returns, values)
        update_policy(policy, states, actions, log_prob_actions, advantages, returns, optimizer, ppo_steps, ppo_clip)
        
        if i_episode % display_every == 0:
            # SMA100: Average of past 100 period (Simple Moving Average)
            print(f'\rEpisode {i_episode}\tAvg Score (SMA100): {np.mean(scores_window):.3f} Current Score: {scores_window[-1]:.0f}\nAvg Episode Length (SMA100): {np.mean(time_taken_window)} Current Episode Length: {time_taken_window[-1]:.0f}\nLanding Rate: {landing_rate.count(1):.0f}% | Success Rate: {success_rate.count(1):.0f}%\n')
            torch.save(policy.state_dict(), f'.\models\{model_name+str(i_episode)}_train.pth')
        if i_episode % (display_every*2) == 0:
            save_video_PPO(policy, 'PPO', f'{model_name+str(i_episode)}_train.pth',max_t=max_t, seed = i_episode)
        elif i_episode % (display_every*2+1) == 0:
            show_video('PPO', 200)
        if np.mean(scores_window)>=200.0 and not shown:
            shown = True
            print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)))
    
    torch.save(policy.state_dict(), f'.\models\{model_name}_best.pth')
    
    return {
        'scores': scores, 'scores_SMA100': scores_SMA100,
        'scores_window': scores_window, 'time_taken': time_taken,
        'time_taken_window': time_taken_window, 'success_rate': success_rate, 
        'landing_rate': landing_rate, 'landing_rate_SMA100': landing_rate_SMA100,
        'success_rate_SMA100': success_rate_SMA100
    }
In [91]:
def test_agent(max_t: int=1000, 
        eps: float=0.01, display_every: int=150, model_ckpt = 'DQN_1K_best.pth', agent=None):
    '''
    Train a Network agent
    
    Parameters:
        max_t (int): Maximum number of timesteps per episode
        eps: Will be set to 0.01 as it is the final epsilon returned after training
    '''
    scores = []                        # list containing scores from each episode
    scores_SMA100 = []
    scores_window = deque(maxlen=100)  # last 100 scores
    time_taken = []
    time_taken_window = deque(maxlen=100)
    success_rate = deque(maxlen=100)
    landing_rate = deque(maxlen=100)
    success_rate_SMA100 = []
    landing_rate_SMA100 = []
    
    agent.qnetwork_online.load_state_dict(torch.load('./models/' + model_ckpt))
    agent.qnetwork_online.eval()
    
    for i_episode in tqdm(range(1, 501)):
        state = env.reset()[0]
        score = 0
        for t in range(max_t):
            state = torch.from_numpy(state).float().unsqueeze(0).to(device)
            with torch.no_grad():
                action_values = agent.qnetwork_online(state)
            if random.random() > eps:
                action = np.argmax(action_values.cpu().data.numpy())
            else:
                action = random.choice(np.arange(4))
            next_state, reward, done, _, _ = env.step(action)
            state = next_state
            score += reward
            if done:
                if score >= 200:
                    success_rate.append(1)
                    landing_rate.append(1)
                elif score >= 120:
                    success_rate.append(0)
                    landing_rate.append(1)
                else:
                    success_rate.append(0)
                    landing_rate.append(0)
                break
        scores_window.append(score)       # save most recent score
        scores_SMA100.append(np.mean(scores_window))
        scores.append(score)              # save most recent score
        time_taken.append(t)
        time_taken_window.append(t)
        landing_rate_SMA100.append(landing_rate.count(1))
        success_rate_SMA100.append(success_rate.count(1))
        if i_episode % display_every == 0:
            # SMA100: Average of past 100 period (Simple Moving Average)
            print(f'\rEpisode {i_episode}\tAvg Score (SMA100): {np.mean(scores_window):.3f} Current Score: {scores_window[-1]:.0f}\nAvg Episode Length (SMA100): {np.mean(time_taken_window)} Current Episode Length: {time_taken_window[-1]:.3f}\nLanding Rate: {landing_rate.count(1):.0f}% | Success Rate: {success_rate.count(1):.0f}%\n')
    
    return {
        'scores': scores, 'scores_SMA100': scores_SMA100,
        'scores_window': scores_window, 'time_taken': time_taken,
        'time_taken_window': time_taken_window, 'success_rate': success_rate, 
        'landing_rate': landing_rate, 'landing_rate_SMA100': landing_rate_SMA100,
        'success_rate_SMA100': success_rate_SMA100
    }
In [13]:
def test_policy(env, policy, max_t=1000,
                display_every=100, model_ckpt=''):
    
    # Load saved weights
    policy.load_state_dict(torch.load('./models/' + model_ckpt + '.pth'))
    
    # Put model to eval
    policy.eval()
    # Metrics variables
    scores = []                        # list containing scores from each episode
    scores_window = deque(maxlen=100)  # last 100 scores
    time_taken = []
    time_taken_window = deque(maxlen=100)
    success_rate = deque(maxlen=100)
    landing_rate = deque(maxlen=100)
    success_rate_SMA100 = []
    landing_rate_SMA100 = []
    scores_SMA100 = []
    for i_episode in tqdm(range(1, 500+1)):
        score = 0
        state = env.reset()[0]
        for t in range(max_t):
            state = torch.FloatTensor(state).unsqueeze(0)
            action_pred, _ = policy(state)
            action_prob = F.softmax(action_pred, dim = -1)
            dist = distributions.Categorical(action_prob)
            action = dist.sample()
            state, reward, done, _, _ = env.step(action.item())
            score += reward
            if done:
                if score >= 200:
                    success_rate.append(1)
                    landing_rate.append(1)
                elif score >= 120:
                    success_rate.append(0)
                    landing_rate.append(1)
                else:
                    success_rate.append(0)
                    landing_rate.append(0)
                break
        # Record metrics
        scores_window.append(score)       # save most recent score
        scores_SMA100.append(np.mean(scores_window))
        scores.append(score)              # save most recent score
        time_taken.append(t)
        time_taken_window.append(t)
        landing_rate_SMA100.append(landing_rate.count(1))
        success_rate_SMA100.append(success_rate.count(1))
        if i_episode % display_every == 0:
            # SMA100: Average of past 100 period (Simple Moving Average)
            print(f'\rEpisode {i_episode}\tAvg Score (SMA100): {np.mean(scores_window):.3f} Current Score: {scores_window[-1]:.0f}\nAvg Episode Length (SMA100): {np.mean(time_taken_window)} Current Episode Length: {time_taken_window[-1]:.0f}\nLanding Rate: {landing_rate.count(1):.0f}% | Success Rate: {success_rate.count(1):.0f}%\n')
    return {
        'scores': scores, 'scores_SMA100': scores_SMA100,
        'scores_window': scores_window, 'time_taken': time_taken,
        'time_taken_window': time_taken_window, 'success_rate': success_rate, 
        'landing_rate': landing_rate, 'landing_rate_SMA100': landing_rate_SMA100,
        'success_rate_SMA100': success_rate_SMA100
    }

8.1 Training All Models (1000 Episodes)¶

Trained models (All 1000 episodes)

  • DQN
  • DDQN
  • Actor & Critic /w PPO
  • DQN + PER (Hyperparameter tuned)

  • Back to content table

DQN - Training environment (seeds 0, 1)

In [337]:
np.random.seed(0)
env = gym.make('LunarLander-v2',enable_wind=True)
env.reset(seed = 1)

agent = Agent(8, 4, hidden_dim=64, network=QNetwork)
DQN_trained_1k = train_agent_1k(display_every=200, max_t=1000, model_name='DQN_1K')
DQN_trained_1k['eps']
  0%|          | 0/1000 [00:00<?, ?it/s]
Episode 200	Avg Score (SMA100): -186.146 Current Score: -160
Avg Episode Length (SMA100): 148.93 Current Episode Length: 168.000
Landing Rate: 0% | Success Rate: 0%

Episode 400	Avg Score (SMA100): -74.060 Current Score: -17
Avg Episode Length (SMA100): 684.3 Current Episode Length: 999.000
Landing Rate: 3% | Success Rate: 1%

Moviepy - Building video video/LunarLander_training.mp4.
Moviepy - Writing video video/LunarLander_training.mp4

t:   0%|          | 0/418 [00:00<?, ?it/s, now=None]
t:  13%|█▎        | 53/418 [00:00<00:00, 526.39it/s, now=None]
t:  43%|████▎     | 179/418 [00:00<00:00, 952.50it/s, now=None]
t:  72%|███████▏  | 303/418 [00:00<00:00, 1080.26it/s, now=None]
t: 100%|██████████| 418/418 [00:00<00:00, 1105.74it/s, now=None]
                                                                
Moviepy - Done !
Moviepy - video ready video/LunarLander_training.mp4
Episode 600	Avg Score (SMA100): -46.550 Current Score: -182
Avg Episode Length (SMA100): 471.64 Current Episode Length: 118.000
Landing Rate: 30% | Success Rate: 7%

Episode 800	Avg Score (SMA100): -70.309 Current Score: -82
Avg Episode Length (SMA100): 934.42 Current Episode Length: 999.000
Landing Rate: 23% | Success Rate: 5%

Moviepy - Building video video/LunarLander_training.mp4.
Moviepy - Writing video video/LunarLander_training.mp4

t:   0%|          | 0/1000 [00:00<?, ?it/s, now=None]
t:   5%|▍         | 47/1000 [00:00<00:02, 467.45it/s, now=None]
t:  14%|█▍        | 139/1000 [00:00<00:01, 732.10it/s, now=None]
t:  24%|██▍       | 244/1000 [00:00<00:00, 875.23it/s, now=None]
t:  34%|███▍      | 344/1000 [00:00<00:00, 922.94it/s, now=None]
t:  44%|████▎     | 437/1000 [00:00<00:00, 911.00it/s, now=None]
t:  54%|█████▍    | 540/1000 [00:00<00:00, 949.98it/s, now=None]
t:  64%|██████▍   | 643/1000 [00:00<00:00, 975.44it/s, now=None]
t:  74%|███████▍  | 741/1000 [00:00<00:00, 970.01it/s, now=None]
t:  84%|████████▍ | 839/1000 [00:00<00:00, 970.93it/s, now=None]
t:  96%|█████████▌| 956/1000 [00:01<00:00, 1029.17it/s, now=None]
                                                                 
Moviepy - Done !
Moviepy - video ready video/LunarLander_training.mp4
Episode 1000	Avg Score (SMA100): 129.990 Current Score: 26
Avg Episode Length (SMA100): 542.53 Current Episode Length: 999.000
Landing Rate: 80% | Success Rate: 47%

Out[337]:
0.01
In [344]:
saveJSON(DQN_trained_1k,'DQN_trained_1k.json')

DDQN - Training Environment (seeds 0, 1)

  • Notice that the environment used in the videos are consistent
In [338]:
np.random.seed(0)
env = gym.make('LunarLander-v2',enable_wind=True)
env.reset(seed = 1)

agent = Agent(8, 4, hidden_dim=32, network=DDQN)
DDQN_trained_1k = train_agent_1k(display_every=200, max_t=1000, model_name='DDQN_1K')
DDQN_trained_1k['eps']
  0%|          | 0/1000 [00:00<?, ?it/s]
Episode 200	Avg Score (SMA100): -330.463 Current Score: -186
Avg Episode Length (SMA100): 109.86 Current Episode Length: 117.000
Landing Rate: 0% | Success Rate: 0%

Episode 400	Avg Score (SMA100): -138.476 Current Score: -39
Avg Episode Length (SMA100): 560.67 Current Episode Length: 999.000
Landing Rate: 0% | Success Rate: 0%

Moviepy - Building video video/LunarLander_training.mp4.
Moviepy - Writing video video/LunarLander_training.mp4

t:   0%|          | 0/1000 [00:00<?, ?it/s, now=None]
t:   5%|▌         | 52/1000 [00:00<00:01, 516.24it/s, now=None]
t:  17%|█▋        | 172/1000 [00:00<00:00, 914.58it/s, now=None]
t:  28%|██▊       | 284/1000 [00:00<00:00, 1005.71it/s, now=None]
t:  39%|███▉      | 394/1000 [00:00<00:00, 1039.37it/s, now=None]
t:  51%|█████▏    | 513/1000 [00:00<00:00, 1089.85it/s, now=None]
t:  64%|██████▍   | 638/1000 [00:00<00:00, 1140.52it/s, now=None]
t:  76%|███████▌  | 756/1000 [00:00<00:00, 1151.25it/s, now=None]
t:  88%|████████▊ | 878/1000 [00:00<00:00, 1170.21it/s, now=None]
                                                                 
Moviepy - Done !
Moviepy - video ready video/LunarLander_training.mp4
Episode 600	Avg Score (SMA100): -147.844 Current Score: -144
Avg Episode Length (SMA100): 336.39 Current Episode Length: 147.000
Landing Rate: 1% | Success Rate: 0%

Episode 800	Avg Score (SMA100): -108.500 Current Score: -57
Avg Episode Length (SMA100): 440.15 Current Episode Length: 999.000
Landing Rate: 5% | Success Rate: 0%

Moviepy - Building video video/LunarLander_training.mp4.
Moviepy - Writing video video/LunarLander_training.mp4

t:   0%|          | 0/193 [00:00<?, ?it/s, now=None]
t:  28%|██▊       | 54/193 [00:00<00:00, 537.78it/s, now=None]
t:  83%|████████▎ | 161/193 [00:00<00:00, 847.69it/s, now=None]
                                                               
Moviepy - Done !
Moviepy - video ready video/LunarLander_training.mp4
Episode 1000	Avg Score (SMA100): -169.918 Current Score: -28
Avg Episode Length (SMA100): 611.61 Current Episode Length: 999.000
Landing Rate: 0% | Success Rate: 0%

Out[338]:
0.01

PPO - Training Environment

In [346]:
actor = MLP(8, 128, 4).to(device)
critic = MLP(8, 128, 1).to(device)
policy = ActorCritic(actor, critic)
policy.apply(init_weights)
optimizer = optim.Adam(policy.parameters(), lr = 0.0005)

np.random.seed(0)
env = gym.make('LunarLander-v2',enable_wind=True)
env.reset(seed = 1)

PPO_trained_1k = train_policy_1k(env, policy, optimizer, 0.99, 5, 0.2, 1000,'PPO_trained_1k',200)
  0%|          | 0/1000 [00:00<?, ?it/s]
Episode 200	Avg Score (SMA100): -147.075 Current Score: -71
Avg Episode Length (SMA100): 110.81 Current Episode Length: 194
Landing Rate: 0% | Success Rate: 0%

Episode 400	Avg Score (SMA100): -108.386 Current Score: -82
Avg Episode Length (SMA100): 616.46 Current Episode Length: 706
Landing Rate: 0% | Success Rate: 0%

Moviepy - Building video video/PPO.mp4.
Moviepy - Writing video video/PPO.mp4

t:   0%|          | 0/1000 [00:00<?, ?it/s, now=None]
t:   5%|▍         | 48/1000 [00:00<00:02, 473.10it/s, now=None]
t:  17%|█▋        | 169/1000 [00:00<00:00, 899.61it/s, now=None]
t:  30%|██▉       | 299/1000 [00:00<00:00, 1078.22it/s, now=None]
t:  43%|████▎     | 426/1000 [00:00<00:00, 1152.33it/s, now=None]
t:  55%|█████▌    | 553/1000 [00:00<00:00, 1189.53it/s, now=None]
t:  68%|██████▊   | 677/1000 [00:00<00:00, 1205.76it/s, now=None]
t:  81%|████████  | 808/1000 [00:00<00:00, 1238.49it/s, now=None]
t:  94%|█████████▍| 939/1000 [00:00<00:00, 1257.56it/s, now=None]
                                                                 
Moviepy - Done !
Moviepy - video ready video/PPO.mp4
Episode 600	Avg Score (SMA100): -12.633 Current Score: -10
Avg Episode Length (SMA100): 656.43 Current Episode Length: 328
Landing Rate: 11% | Success Rate: 2%

Episode 800	Avg Score (SMA100): 40.668 Current Score: -77
Avg Episode Length (SMA100): 757.44 Current Episode Length: 503
Landing Rate: 28% | Success Rate: 5%

Moviepy - Building video video/PPO.mp4.
Moviepy - Writing video video/PPO.mp4

t:   0%|          | 0/556 [00:00<?, ?it/s, now=None]
t:  10%|█         | 56/556 [00:00<00:00, 557.95it/s, now=None]
t:  32%|███▏      | 179/556 [00:00<00:00, 952.15it/s, now=None]
t:  54%|█████▍    | 302/556 [00:00<00:00, 1075.64it/s, now=None]
t:  76%|███████▋  | 424/556 [00:00<00:00, 1132.46it/s, now=None]
t:  99%|█████████▉| 552/556 [00:00<00:00, 1185.37it/s, now=None]
                                                                
Moviepy - Done !
Moviepy - video ready video/PPO.mp4
Episode 1000	Avg Score (SMA100): 45.145 Current Score: 86
Avg Episode Length (SMA100): 683.03 Current Episode Length: 785
Landing Rate: 24% | Success Rate: 6%

DQN + PER - Training Environment

In [358]:
GAMMA = 0.9875

np.random.seed(0)
env = gym.make('LunarLander-v2',enable_wind=True)
env.reset(seed = 1)


agent = PTRAgent(8, 4, hidden_dim=32, LR=0.0001077, weight_decay=0.000001, network=QNetwork)
DQN_PER_trained_1k = train_agent_1k(display_every=200, max_t=800,model_name='DQN_PER_1k')
DQN_PER_trained_1k['eps']
  0%|          | 0/1000 [00:00<?, ?it/s]
Episode 200	Avg Score (SMA100): -111.753 Current Score: -204
Avg Episode Length (SMA100): 290.27 Current Episode Length: 534.000
Landing Rate: 0% | Success Rate: 0%

Episode 400	Avg Score (SMA100): -74.284 Current Score: -39
Avg Episode Length (SMA100): 632.84 Current Episode Length: 799.000
Landing Rate: 0% | Success Rate: 0%

Moviepy - Building video video/LunarLander_training.mp4.
Moviepy - Writing video video/LunarLander_training.mp4

t:   0%|          | 0/896 [00:00<?, ?it/s, now=None]
t:   6%|▌         | 54/896 [00:00<00:01, 537.00it/s, now=None]
t:  21%|██        | 186/896 [00:00<00:00, 993.75it/s, now=None]
t:  35%|███▌      | 317/896 [00:00<00:00, 1133.04it/s, now=None]
t:  50%|█████     | 449/896 [00:00<00:00, 1204.96it/s, now=None]
t:  64%|██████▍   | 573/896 [00:00<00:00, 1216.00it/s, now=None]
t:  79%|███████▊  | 704/896 [00:00<00:00, 1247.39it/s, now=None]
t:  93%|█████████▎| 834/896 [00:00<00:00, 1263.62it/s, now=None]
                                                                
Moviepy - Done !
Moviepy - video ready video/LunarLander_training.mp4
Episode 600	Avg Score (SMA100): 22.136 Current Score: 134
Avg Episode Length (SMA100): 695.06 Current Episode Length: 799.000
Landing Rate: 24% | Success Rate: 10%

Episode 800	Avg Score (SMA100): 119.191 Current Score: 268
Avg Episode Length (SMA100): 632.84 Current Episode Length: 799.000
Landing Rate: 68% | Success Rate: 58%

Moviepy - Building video video/LunarLander_training.mp4.
Moviepy - Writing video video/LunarLander_training.mp4

t:   0%|          | 0/211 [00:00<?, ?it/s, now=None]
t:  30%|███       | 64/211 [00:00<00:00, 637.27it/s, now=None]
t:  94%|█████████▍| 198/211 [00:00<00:00, 1045.46it/s, now=None]
                                                                
Moviepy - Done !
Moviepy - video ready video/LunarLander_training.mp4
Episode 1000	Avg Score (SMA100): 187.603 Current Score: 160
Avg Episode Length (SMA100): 391.93 Current Episode Length: 451.000
Landing Rate: 86% | Success Rate: 71%

Moviepy - Done !
Moviepy - video ready video/LunarLander_training.mp4
Out[358]:
0.01

8.2 Testing All Models (500 Episodes) 🔬¶

Metrics Recorded (SMA100)

  • Landing Rate
  • Success Rate
  • Scores

  • Back to content table

DQN - Testing Environment

In [42]:
np.random.seed(2)
env = gym.make('LunarLander-v2',enable_wind=True)
env.reset(seed = 2)

agent_dqn = Agent(8, 4, hidden_dim=64, network=QNetwork)

DQN_test_results = test_agent(1000, model_ckpt = 'DQN_1K_best.pth', display_every=100,agent = agent_dqn)
  0%|          | 0/500 [00:00<?, ?it/s]
Episode 100	Avg Score (SMA100): 113.749 Current Score: 134
Avg Episode Length (SMA100): 459.14 Current Episode Length: 756.000
Landing Rate: 61% | Success Rate: 49%

Episode 200	Avg Score (SMA100): 103.495 Current Score: 226
Avg Episode Length (SMA100): 457.06 Current Episode Length: 609.000
Landing Rate: 68% | Success Rate: 44%

Episode 300	Avg Score (SMA100): 123.541 Current Score: 242
Avg Episode Length (SMA100): 492.69 Current Episode Length: 650.000
Landing Rate: 70% | Success Rate: 50%

Episode 400	Avg Score (SMA100): 100.395 Current Score: 197
Avg Episode Length (SMA100): 507.05 Current Episode Length: 634.000
Landing Rate: 68% | Success Rate: 48%

Episode 500	Avg Score (SMA100): 118.841 Current Score: 211
Avg Episode Length (SMA100): 473.13 Current Episode Length: 445.000
Landing Rate: 72% | Success Rate: 53%

In [43]:
saveJSON(DQN_test_results, 'DQN_test.json')

DDQN - Testing Environment

In [23]:
np.random.seed(2)
env = gym.make('LunarLander-v2',enable_wind=True)
env.reset(seed = 2)

agent_ddqn = Agent(8, 4, hidden_dim=32, network=DDQN)

DDQN_test_results = test_agent(1000, model_ckpt = 'DDQN_1K_best.pth', display_every=100, agent=agent_ddqn)
  0%|          | 0/500 [00:00<?, ?it/s]
Episode 100	Avg Score (SMA100): -191.774 Current Score: -257
Avg Episode Length (SMA100): 650.76 Current Episode Length: 812.000
Landing Rate: 0% | Success Rate: 0%

Episode 200	Avg Score (SMA100): -190.667 Current Score: -113
Avg Episode Length (SMA100): 656.98 Current Episode Length: 247.000
Landing Rate: 0% | Success Rate: 0%

Episode 300	Avg Score (SMA100): -178.545 Current Score: -276
Avg Episode Length (SMA100): 631.19 Current Episode Length: 466.000
Landing Rate: 2% | Success Rate: 1%

Episode 400	Avg Score (SMA100): -189.764 Current Score: -255
Avg Episode Length (SMA100): 655.66 Current Episode Length: 423.000
Landing Rate: 1% | Success Rate: 0%

Episode 500	Avg Score (SMA100): -189.797 Current Score: -217
Avg Episode Length (SMA100): 670.35 Current Episode Length: 545.000
Landing Rate: 0% | Success Rate: 0%

In [40]:
saveJSON(DDQN_test_results, 'DDQN_test.json')

PPO - Testing Environment

In [37]:
np.random.seed(2)
env = gym.make('LunarLander-v2',enable_wind=True)
env.reset(seed = 2)

actor = MLP(8, 128, 4).to(device)
critic = MLP(8, 128, 1).to(device)
policy = ActorCritic(actor, critic)
policy.apply(init_weights)

PPO_test_results = test_policy(env, policy, max_t=1000, display_every=100, model_ckpt='PPO_trained_1k1000_train')
  0%|          | 0/500 [00:00<?, ?it/s]
Episode 100	Avg Score (SMA100): -11.810 Current Score: -97
Avg Episode Length (SMA100): 743.14 Current Episode Length: 211
Landing Rate: 16% | Success Rate: 2%

Episode 200	Avg Score (SMA100): 28.305 Current Score: 116
Avg Episode Length (SMA100): 793.0 Current Episode Length: 761
Landing Rate: 26% | Success Rate: 1%

Episode 300	Avg Score (SMA100): 7.690 Current Score: 13
Avg Episode Length (SMA100): 771.94 Current Episode Length: 999
Landing Rate: 25% | Success Rate: 0%

Episode 400	Avg Score (SMA100): -19.904 Current Score: 80
Avg Episode Length (SMA100): 708.16 Current Episode Length: 999
Landing Rate: 20% | Success Rate: 0%

Episode 500	Avg Score (SMA100): -6.674 Current Score: 107
Avg Episode Length (SMA100): 719.03 Current Episode Length: 790
Landing Rate: 14% | Success Rate: 1%

In [41]:
saveJSON(PPO_test_results, 'PPO_test.json')

DQN + PER - Testing Environment

In [46]:
# Reset seed - Use same seed for all experiments (objective comparison)
np.random.seed(2)
env = gym.make('LunarLander-v2',enable_wind=True)
env.reset(seed = 2)

agent_dqn_per = Agent(8, 4, hidden_dim=32, network=QNetwork)

DQN_PER_test_results = test_agent(1000, model_ckpt = 'DQN_PER_1k_best.pth', display_every=100, agent=agent_dqn_per)
  0%|          | 0/500 [00:00<?, ?it/s]
Episode 100	Avg Score (SMA100): 198.433 Current Score: 229
Avg Episode Length (SMA100): 437.23 Current Episode Length: 338.000
Landing Rate: 87% | Success Rate: 69%

Episode 200	Avg Score (SMA100): 188.209 Current Score: -32
Avg Episode Length (SMA100): 429.25 Current Episode Length: 440.000
Landing Rate: 86% | Success Rate: 70%

Episode 300	Avg Score (SMA100): 190.582 Current Score: 45
Avg Episode Length (SMA100): 484.18 Current Episode Length: 167.000
Landing Rate: 90% | Success Rate: 69%

Episode 400	Avg Score (SMA100): 182.821 Current Score: 263
Avg Episode Length (SMA100): 451.55 Current Episode Length: 412.000
Landing Rate: 88% | Success Rate: 73%

Episode 500	Avg Score (SMA100): 175.023 Current Score: 214
Avg Episode Length (SMA100): 449.08 Current Episode Length: 367.000
Landing Rate: 84% | Success Rate: 60%

In [47]:
saveJSON(DQN_PER_test_results, 'DQN_PER_test.json')

8.3 Final Evaluation of Testing Results 🔬¶

  • Plots
  • Videos
  • Conclusion

  • Back to content table

DQN - Test Results

In [48]:
# Plot result
DQN_test = loadJSON('DQN_test.json')
sns.set_style("whitegrid")
plotResult(DQN_test, [['success_rate_SMA100', 'landing_rate_SMA100'], ['scores_SMA100']])
Past 500 Episodes
====================================
Final success_rate_SMA100: 53.00
Final landing_rate_SMA100: 72.00
Final scores_SMA100: 118.84

Observations
In the 500 Episodes trained, the standard DQN achieved an average landing rate (SMA100) of 61.73% and an average success rate (SMA100) of 43.95%. Also, it achieved an average score (SMA100) of 110 in the testing results.

DDQN - Test Results

In [49]:
# Plot result
DDQN_test = loadJSON('DDQN_test.json')
sns.set_style("whitegrid")
plotResult(DDQN_test, [['success_rate_SMA100', 'landing_rate_SMA100'], ['scores_SMA100']])
Past 500 Episodes
====================================
Final success_rate_SMA100: 0.00
Final landing_rate_SMA100: 0.00
Final scores_SMA100: -189.80

Observations
In the 500 Episodes trained, the DDQN algorithm achieved a highest landing rate (SMA100) of 2% and a highest success rate (SMA100) of 1%. Also, it achieved an average score (SMA100) of -188 in the testing results. This set of result is extremely poor compared to the standard DQN.

PPO - Test Results

In [50]:
# Plot result
PPO_test = loadJSON('PPO_test.json')
sns.set_style("whitegrid")
plotResult(PPO_test, [['success_rate_SMA100', 'landing_rate_SMA100'], ['scores_SMA100']])
Past 500 Episodes
====================================
Final success_rate_SMA100: 1.00
Final landing_rate_SMA100: 14.00
Final scores_SMA100: -6.67

Observations
In the 500 Episodes trained, the PPO algorithm achieved an average landing rate (SMA100) of 18% and an average success rate (SMA100) of 0.8%. Also, it achieved an average score (SMA100) of -2.33 in the testing results. Although slightly better than the DDQN, it is still extremely poor compared to the standard DQN.

DQN + PER - Test Results

In [51]:
# Plot result
DQN_PER_test = loadJSON('DQN_PER_test.json')
sns.set_style("whitegrid")
plotResult(DQN_PER_test, [['success_rate_SMA100', 'landing_rate_SMA100'], ['scores_SMA100']])
Past 500 Episodes
====================================
Final success_rate_SMA100: 60.00
Final landing_rate_SMA100: 84.00
Final scores_SMA100: 175.02

Observations
In the 500 Episodes trained, the optimized DQN + PER algorithm achieved an average landing rate (SMA100) of 79.2% and an average success rate (SMA100) of 63%. Also, it achieved an average scores (SMA100) of 190 in the testing results. Out of all the algorithms evaluated, this set of result is by far the best.

Displaying videos for each model.

In [76]:
reward_dqn = save_video(agent_dqn, 'DQN_test', 'DQN_1K_best.pth', 1000, 0)
reward_dqn
Moviepy - Building video video/DQN_test.mp4.
Moviepy - Writing video video/DQN_test.mp4

                                                                
Moviepy - Done !
Moviepy - video ready video/DQN_test.mp4
Out[76]:
163.0976443844678
In [67]:
reward_ddqn = save_video(agent_ddqn, 'DDQN_test', 'DDQN_1K_best.pth', 1000, 0)
reward_ddqn
Moviepy - Building video video/DDQN_test.mp4.
Moviepy - Writing video video/DDQN_test.mp4

                                                                
Moviepy - Done !
Moviepy - video ready video/DDQN_test.mp4

Out[67]:
-184.18056191070494
In [79]:
reward_ppo = save_video_PPO(policy, 'PPO_test', 'PPO_trained_1k1000_train.pth', 1000, seed = 0)
reward_ppo
Moviepy - Building video video/PPO_test.mp4.
Moviepy - Writing video video/PPO_test.mp4

                                                                 
Moviepy - Done !
Moviepy - video ready video/PPO_test.mp4

Out[79]:
72.27011975068002
In [86]:
reward_dqn_per = save_video(agent_dqn_per, 'DQN_PER_test', 'DQN_PER_1k_best.pth', 1000, 0)
reward_dqn_per
Moviepy - Building video video/DQN_PER_test.mp4.
Moviepy - Writing video video/DQN_PER_test.mp4

                                                                
Moviepy - Done !
Moviepy - video ready video/DQN_PER_test.mp4

Out[86]:
305.26111013472416

Embedding mp4 files into html ⚙️

In [89]:
filepaths = ['DQN_test.mp4','DDQN_test.mp4','PPO_test.mp4','DQN_PER_test.mp4']
rewards = [reward_dqn, reward_ddqn, reward_ppo, reward_dqn_per]

grid_html = '''
<style>
.video-grid {{
  display: table;
  width: 100%;
}}
.video-row {{
  display: table-row;
}}
.video-item {{
  display: table-cell;
  width: 50%;
  height: 400px;
  text-align: center;
  vertical-align: middle;
}}
</style>
<div class="video-grid">
{}
</div>
'''

video_html = '''
<div class="video-item">
  <h3>{}</h3>
  <video alt="test" autoplay loop controls width="80%">
    <source src="data:video/mp4;base64,{}" type="video/mp4" />
  </video>
</div>
'''

videos = ''
for i, file_name in enumerate(filepaths):
    mp4 = 'video/{}'.format(file_name)
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    video_item = video_html.format(f'{file_name} | reward: {rewards[i]:.3f}', encoded.decode('ascii'))
    if i % 2 == 0:
        videos += '<div class="video-row">'
    videos += video_item
    if i % 2 == 1:
        videos += '</div>'

display.display(display.HTML(grid_html.format(videos)))

DQN_test.mp4 | reward: 163.098

DDQN_test.mp4 | reward: -184.181

PPO_test.mp4 | reward: 72.270

DQN_PER_test.mp4 | reward: 305.261

In the same environment,
It can be seen that the DDQN failed to land the spaceship.
The PPO managed to land the spaceship but not within the flag.
The DQN managed to land the spaceship within the flag (successfully) in 19 seconds - inefficient.
Lastly, the DQN + PER managed to land the spaceship successfully in only 5 seconds, outperforming the other models.

Conclusion¶

In this project, we successfully improved our reinforcement learning implementation by constructing and testing a few networks avaliable for reinforcement learning. After experimenting with various methods including DQN, DDQN, and Actor-Critic with PPO, we found that the optimal solution was a DQN with a Prioritized Replay Buffer. To optimize our model further, we conducted hyperparameter tuning on five key variables. Furthermore, to objectively evaluate our models, we used the same training environment (seed: 1) for all networks and tested them on a separate testing environment (seed: 2).